Proof of Concept #3

Merged
newsch merged 6 commits from proof-of-concept into main 2023-06-23 19:50:04 +00:00
11 changed files with 1653 additions and 34 deletions

1089
Cargo.lock generated

File diff suppressed because it is too large Load diff

View file

@ -4,10 +4,21 @@ version = "0.0.0"
license = "AGPL-3.0-only"
edition = "2021"
repository = "https://github.com/organicmaps/wikiparser/"
default-run = "om-wikiparser"
# See more keys and their definitions at https://doc.rust-lang.org/cargo/reference/manifest.html
[dependencies]
anyhow = { version = "1.0.71", features = ["backtrace"] }
clap = { version = "4.3.2", features = ["derive"] }
env_logger = "0.10.0"
log = "0.4.18"
once_cell = "1.18.0"
scraper = "0.16.0"
serde = { version = "1.0.163", features = ["derive"] }
serde_json = "1.0.96"
url = "2.3.1"
urlencoding = "2.1.2"
[profile.release]
debug = true
overflow-checks = true

View file

@ -1,3 +1,8 @@
# wikiparser
_Extracts articles from [Wikipedia database dumps](https://en.wikipedia.org/wiki/Wikipedia:Database_download) for embedding into the `mwm` map files created by [the Organic Maps generator](https://github.com/organicmaps/organicmaps/blob/master/tools/python/maps_generator/README.md)._
## Usage
[`article_processing_config.json`](article_processing_config.json) should be updated when adding a new language.
biodranik commented 2023-06-22 17:27:08 +00:00 (Migrated from github.com)
Review

... It defines the article's sections that are not important for users and should be removed.

... It defines the article's sections that are not important for users and should be removed.
It defines article sections that are not important for users and should be removed.

View file

@ -0,0 +1,44 @@
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
{
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"sections_to_remove": {
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"de": [
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Anmerkungen",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Anmerkungen und Einzelnachweise",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Einzelbelege",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Einzelnachweise",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Filme",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Literatur",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Siehe auch",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Weblinks"
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
],
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"en": [
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Bibliography",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"External links",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Further reading",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"References",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"See also",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Sources"
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
],
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"es": [
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Enlaces externos",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Referencias",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Véase también",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Vínculos de interés"
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
],
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"fr": [
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Articles connexes",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Bibliographie",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Lien externe",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Liens externes",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Notes et références",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Références",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Voir aussi"
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
],
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"ru": [
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Библиография",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Литература",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Примечания",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"См. также",
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
"Ссылки"
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
]
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
}
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?
}
biodranik commented 2023-06-22 17:28:31 +00:00 (Migrated from github.com)
Review

Does it make sense to sort sections by name?

Does it make sense to sort sections by name?

43
benches/id_parsing.rs Normal file
View file

@ -0,0 +1,43 @@
#![feature(test)]
use std::{collections::HashSet, str::FromStr};
extern crate om_wikiparser;
extern crate test;
#[bench]
fn parse_wikipedia(b: &mut test::Bencher) {
b.iter(|| {
let title = om_wikiparser::wm::WikipediaTitleNorm::from_url(
"https://en.wikipedia.org/wiki/Article_Title",
)
.unwrap();
});
}
#[bench]
fn hash_wikipedia(b: &mut test::Bencher) {
let title = om_wikiparser::wm::WikipediaTitleNorm::from_url(
"https://en.wikipedia.org/wiki/Article_Title",
)
.unwrap();
let mut set = HashSet::new();
b.iter(|| {
set.insert(&title);
});
}
#[bench]
fn parse_wikidata(b: &mut test::Bencher) {
b.iter(|| {
let qid = om_wikiparser::wm::WikidataQid::from_str("Q123456789").unwrap();
});
}
#[bench]
fn hash_wikidata(b: &mut test::Bencher) {
let qid = om_wikiparser::wm::WikidataQid::from_str("Q123456789").unwrap();
let mut set = HashSet::new();
b.iter(|| {
set.insert(&qid);
});
}

18
src/bin/simplify_html.rs Normal file
View file

@ -0,0 +1,18 @@
//! Apply html article simplification to stdin, and write it to stdout.
//!
//! Usage:
//! simplify_html < article.html > simplified.html
use std::io::{stdin, stdout, Read, Write};
use om_wikiparser::html::simplify;
fn main() -> anyhow::Result<()> {
let mut input = String::new();
stdin().read_to_string(&mut input)?;
let output = simplify(&input, "en");
stdout().write_all(output.as_bytes())?;
Ok(())
}

92
src/html.rs Normal file
View file

@ -0,0 +1,92 @@
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
use std::collections::{BTreeMap, BTreeSet};
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
use once_cell::sync::Lazy;
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
use scraper::{ElementRef, Html, Selector};
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
use serde::Deserialize;
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
#[derive(Debug, Deserialize)]
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
struct Config<'a> {
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
#[serde(borrow)]
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
sections_to_remove: BTreeMap<&'a str, BTreeSet<&'a str>>,
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
}
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
static CONFIG: Lazy<Config<'static>> = Lazy::new(|| {
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
serde_json::from_str(include_str!(concat!(
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
env!("CARGO_MANIFEST_DIR"),
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
"/article_processing_config.json"
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
)))
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
.expect("\"article_processing_config.json\" is either invalid json or the wrong structure")
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
});
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
static HEADERS: Lazy<Selector> =
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
Lazy::new(|| Selector::parse("h1, h2, h3, h4, h5, h6, h7").unwrap());
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
pub fn simplify(html: &str, lang: &str) -> String {
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
let mut document = Html::parse_document(html);
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
let mut to_remove = Vec::new();
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
// Remove configured sections and all trailing elements until next section.
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
if let Some(bad_sections) = CONFIG.sections_to_remove.get(lang) {
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
for header in document.select(&HEADERS) {
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
// TODO: Should this join all text nodes?
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
let Some(title) = header.text().next() else {
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
continue
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
};
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
if bad_sections.contains(&title.trim()) {
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
to_remove.push(header.id());
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
let header_level = header.value().name();
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
// Strip trailing nodes.
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
for sibling in header.next_siblings() {
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
if let Some(element) = sibling.value().as_element() {
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
if element.name() == header_level {
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
// TODO: Should this check for a higher level?
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
break;
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
}
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
}
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
to_remove.push(sibling.id());
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
}
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
}
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
}
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
for id in to_remove.drain(..) {
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
if let Some(mut node) = document.tree.get_mut(id) {
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
node.detach();
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
}
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
}
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
} else {
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
warn!("No sections to remove configured for lang {lang:?}");
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
}
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
// Remove elements with no text that isn't whitespace.
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
for element in document
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
.root_element()
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
.descendants()
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
.filter_map(ElementRef::wrap)
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
{
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
if element.text().all(|t| t.trim().is_empty()) {
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
to_remove.push(element.id());
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
}
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
}
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
for id in to_remove.drain(..) {
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
if let Some(mut node) = document.tree.get_mut(id) {
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
node.detach();
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
}
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
}
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
biodranik commented 2023-06-22 21:36:12 +00:00 (Migrated from github.com)
Review

Can copy-paste be avoided?

Can copy-paste be avoided?
newsch commented 2023-06-23 15:01:22 +00:00 (Migrated from github.com)
Review

I'll be improving and refactoring this in the next PR, if this bit survives I'll move it to a function.

I'll be improving and refactoring this in the next PR, if this bit survives I'll move it to a function.
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
document.html()
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
}
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
#[cfg(test)]
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
mod test {
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
biodranik commented 2023-06-22 21:37:15 +00:00 (Migrated from github.com)
Review

Is it hard to make a simple test for the function above?

Is it hard to make a simple test for the function above?
newsch commented 2023-06-23 15:00:08 +00:00 (Migrated from github.com)
Review

No, I just didn't bother since I'll be changing this in the next PR. I can add some if you'd like.

No, I just didn't bother since I'll be changing this in the next PR. I can add some if you'd like.
use super::*;
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
#[test]
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
fn static_config_parses() {
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
assert!(!CONFIG.sections_to_remove.is_empty());
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
}
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.
}
biodranik commented 2023-06-08 08:27:41 +00:00 (Migrated from github.com)
Review

Is there a more robust way to exclude some sections for all languages?

Is there a more robust way to exclude some sections for all languages?
newsch commented 2023-06-08 13:06:45 +00:00 (Migrated from github.com)
Review

Do you mean moving this to a configuration file, or something that works independent of the language?

Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names.

We could collapse it into a single set and apply it to all languages.

Do you mean moving this to a configuration file, or something that works independent of the language? Trying to parse the template text could be language independent, but I think that would be less robust than checking the header names. We could collapse it into a single set and apply it to all languages.
biodranik commented 2023-06-08 13:35:24 +00:00 (Migrated from github.com)
Review

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)

No need to collapse, a config looks like a better option (keep adding many other languages in mind, and potential contributors who doesn't read rust code)
newsch commented 2023-06-08 13:51:06 +00:00 (Migrated from github.com)
Review

Do you want to load it at compile time or runtime?

Do you want to load it at compile time or runtime?
newsch commented 2023-06-08 14:39:06 +00:00 (Migrated from github.com)
Review

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.

I added a compile-time json config in 7e6b39a, adding in a flag for loading one at runtime is straightforward if we want to do that later. Using a different config language is also a quick change.
biodranik commented 2023-06-22 17:30:41 +00:00 (Migrated from github.com)
Review
    // Remove sections.

nit: Normal sentences are more readable in many cases. Here and in other places.

```suggestion // Remove sections. ``` nit: Normal sentences are more readable in many cases. Here and in other places.
biodranik commented 2023-06-22 17:34:03 +00:00 (Migrated from github.com)
Review

What's needed to get right answers to these TODOs?

What's needed to get right answers to these TODOs?
biodranik commented 2023-06-22 17:36:08 +00:00 (Migrated from github.com)
Review

Should title be trimmed?

Should title be trimmed?
newsch commented 2023-06-23 16:20:03 +00:00 (Migrated from github.com)
Review

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size.
That's part of the next work.

I need to look through a good sample of the articles and check for things like formatting in headers, tags that are used, and which wikimedia meta tags are responsible for a third of the document size. That's part of the next work.

5
src/lib.rs Normal file
View file

@ -0,0 +1,5 @@
pub mod html;
pub mod wm;
#[macro_use]
extern crate log;

View file

@ -1,54 +1,118 @@
// Usage:
// pv ~/Downloads/enwiki-NS0-20230401-ENTERPRISE-HTML.json.tar.gz | tar xzO | cargo run --release > /dev/null
// # prep outputs from map generator
// cut -f 2 ~/Downloads/id_to_wikidata.csv > /tmp/wikidata_ids.txt
// tail -n +2 ~/Downloads/wiki_urls.txt | cut -f 3 > /tmp/wikipedia_urls.txt
// # feed gzipped tarfile
// pv ~/Downloads/enwiki-NS0-20230401-ENTERPRISE-HTML.json.tar.gz | tar xzO \
// | cargo run --release -- \
// --wikidata-ids /tmp/wikidata_ids.txt \
// --wikipedia-urls /tmp/wikipedia_urls.txt \
// output_dir
use std::{
fs::{create_dir, File},
io::{stdin, BufRead, Write},
path::{Path, PathBuf},
};
use serde::Deserialize;
use std::io::{self, stdin, BufRead, BufReader, Write};
use anyhow::bail;
use clap::Parser;
#[macro_use]
extern crate log;
#[derive(Deserialize)]
struct Page {
// TODO: check if CoW has a performance impact
name: String,
date_modified: String,
#[serde(default)]
url: String,
main_entity: Option<Wikidata>,
// TODO: see what impact parsing/unescaping/allocating this has
article_body: ArticleBody,
#[serde(default)]
redirects: Vec<Redirect>,
use om_wikiparser::{
html::simplify,
wm::{is_wikidata_match, is_wikipedia_match, parse_wikidata_file, parse_wikipedia_file, Page},
};
#[derive(Parser)]
struct Args {
output_dir: PathBuf,
#[arg(long)]
wikidata_ids: Option<PathBuf>,
#[arg(long)]
wikipedia_urls: Option<PathBuf>,
}
#[derive(Deserialize)]
struct Wikidata {
identifier: String,
}
fn write(dir: impl AsRef<Path>, page: Page) -> anyhow::Result<()> {
let Some(qid) = page.main_entity.map(|e| e.identifier) else {
// TODO: handle and still write
bail!("Page in list but without wikidata qid: {:?} ({})", page.name, page.url);
};
#[derive(Deserialize)]
struct ArticleBody {
html: String,
}
let mut filename = dir.as_ref().to_owned();
filename.push(qid);
filename.push(&page.in_language.identifier);
filename.set_extension("html");
#[derive(Deserialize)]
struct Redirect {
url: String,
name: String,
debug!("{:?}: {:?}", page.name, filename);
if filename.exists() {
debug!("Exists, skipping");
return Ok(());
}
let subfolder = filename.parent().unwrap();
if !subfolder.exists() {
create_dir(subfolder)?;
}
let html = simplify(&page.article_body.html, &page.in_language.identifier);
let mut file = File::create(&filename)?;
file.write_all(html.as_bytes())?;
Ok(())
}
fn main() -> anyhow::Result<()> {
let dump = BufReader::new(stdin());
env_logger::Builder::new()
.filter_level(log::LevelFilter::Info)
.parse_default_env()
.try_init()?;
// TODO: compare different deserialization methods
// docs warn against using a reader directly, and it's slower than tar can decompress the dump
let args = Args::parse();
info!("Loading urls");
let wikipedia_titles = args
.wikipedia_urls
.map(parse_wikipedia_file)
.transpose()?
.unwrap_or_default();
info!("Loading ids");
let wikidata_ids = args
.wikidata_ids
.map(parse_wikidata_file)
.transpose()?
.unwrap_or_default();
if !args.output_dir.is_dir() {
bail!("output dir {:?} does not exist", args.output_dir)
}
info!("Processing dump");
let dump = stdin().lock();
// TODO: Compare different deserialization methods.
// The docs warn against using a reader directly, and it's slower than tar can decompress the dump.
// let stream = serde_json::Deserializer::from_reader(dump).into_iter::<Page>();
let stream = dump.lines().map(|r| {
r.map_err(anyhow::Error::new)
.and_then(|s| serde_json::from_str::<Page>(&s).map_err(anyhow::Error::new))
});
let mut stdout = io::stdout();
for page in stream {
let page = page?;
writeln!(stdout, "{}", page.name)?;
if !(is_wikidata_match(&wikidata_ids, &page).is_some()
|| is_wikipedia_match(&wikipedia_titles, &page).is_some())
{
continue;
}
if let Err(e) = write(&args.output_dir, page) {
error!("Error writing article: {}", e);
}
}
Ok(())

205
src/wm/mod.rs Normal file
View file

@ -0,0 +1,205 @@
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
//! Wikimedia types
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
use std::{collections::HashSet, ffi::OsStr, fs, num::ParseIntError, str::FromStr};
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
use anyhow::{anyhow, bail, Context};
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
use url::Url;
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
mod page;
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
pub use page::Page;
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// Read from a file of urls on each line.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
pub fn parse_wikidata_file(path: impl AsRef<OsStr>) -> anyhow::Result<HashSet<WikidataQid>> {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
let contents = fs::read_to_string(path.as_ref())?;
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
contents
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
.lines()
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
.enumerate()
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
.map(|(i, line)| {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
WikidataQid::from_str(line).with_context(|| {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
let line_num = i + 1;
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
format!("bad QID value on line {line_num}: {line:?}")
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
})
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
})
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
.collect()
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// Read article titles from a file of urls on each line.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
pub fn parse_wikipedia_file(
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
path: impl AsRef<OsStr>,
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
) -> anyhow::Result<HashSet<WikipediaTitleNorm>> {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
let contents = fs::read_to_string(path.as_ref())?;
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
contents
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
.lines()
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
.enumerate()
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
.map(|(i, line)| {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
WikipediaTitleNorm::from_url(line).with_context(|| {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
let line_num = i + 1;
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
format!("bad wikipedia url on line {line_num}: {line:?}")
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
})
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
})
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
.collect()
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
pub fn is_wikidata_match(ids: &HashSet<WikidataQid>, page: &Page) -> Option<WikidataQid> {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
let Some(wikidata) = &page.main_entity else { return None;};
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
let wikidata_id = &wikidata.identifier;
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
let wikidata_id = match WikidataQid::from_str(wikidata_id) {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
Ok(qid) => qid,
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
Err(e) => {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
warn!(
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
"Could not parse QID for {:?}: {:?}: {:#}",
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
page.name, wikidata_id, e
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
);
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
return None;
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
};
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
ids.get(&wikidata_id).map(|_| wikidata_id)
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
pub fn is_wikipedia_match(
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
titles: &HashSet<WikipediaTitleNorm>,
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
page: &Page,
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
) -> Option<WikipediaTitleNorm> {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
match WikipediaTitleNorm::from_title(&page.name, &page.in_language.identifier) {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
Err(e) => warn!("Could not parse title for {:?}: {:#}", page.name, e),
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
Ok(title) => {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
if titles.get(&title).is_some() {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
return Some(title);
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
for redirect in &page.redirects {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
match WikipediaTitleNorm::from_title(&redirect.name, &page.in_language.identifier) {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
Err(e) => warn!(
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
"Could not parse redirect title for {:?}: {:?}: {:#}",
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
page.name, redirect.name, e
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
),
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
Ok(title) => {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
if titles.get(&title).is_some() {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
return Some(title);
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
None
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// Wikidata QID/Q Number
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
///
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// See https://www.wikidata.org/wiki/Wikidata:Glossary#QID
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
///
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// ```
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// use std::str::FromStr;
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// use om_wikiparser::wm::WikidataQid;
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
///
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// let with_q = WikidataQid::from_str("Q12345").unwrap();
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// let without_q = WikidataQid::from_str(" 12345 ").unwrap();
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// assert_eq!(with_q, without_q);
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:47:45 +00:00 (Migrated from github.com)
Review

Does it make sense to test it?

Does it make sense to test it?
newsch commented 2023-06-23 16:20:34 +00:00 (Migrated from github.com)
Review

Same as above.

Same as [above](https://github.com/organicmaps/wikiparser/pull/3#discussion_r1239061793).
///
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// assert!(WikidataQid::from_str("q12345").is_ok());
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// assert!(WikidataQid::from_str("https://wikidata.org/wiki/Q12345").is_err());
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// assert!(WikidataQid::from_str("Article_Title").is_err());
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// assert!(WikidataQid::from_str("Q").is_err());
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// assert!(WikidataQid::from_str("").is_err());
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// ```
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
#[derive(Debug, PartialOrd, Ord, PartialEq, Eq, Hash)]
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
pub struct WikidataQid(u32);
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
impl FromStr for WikidataQid {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
type Err = ParseIntError;
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
fn from_str(s: &str) -> Result<Self, Self::Err> {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
let s = s.trim();
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
let s = s.strip_prefix(['Q', 'q']).unwrap_or(s);
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
u32::from_str(s).map(WikidataQid)
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// Normalized wikipedia article title that can compare:
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// - titles `Spatial Database`
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// - urls `https://en.wikipedia.org/wiki/Spatial_database#Geodatabase`
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// - osm-style tags `en:Spatial Database`
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
///
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// ```
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// use om_wikiparser::wm::WikipediaTitleNorm;
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
///
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// let title = WikipediaTitleNorm::from_title("Article Title", "en").unwrap();
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// let url = WikipediaTitleNorm::from_url("https://en.wikipedia.org/wiki/Article_Title#Section").unwrap();
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// assert_eq!(url, title);
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:47:31 +00:00 (Migrated from github.com)
Review

Does it make sense to test it?

Does it make sense to test it?
newsch commented 2023-06-23 16:08:14 +00:00 (Migrated from github.com)
Review

I added some checks for whitespace, empty strings, and tests for errors in 70f7edf, is there something else you think should be handled?

I added some checks for whitespace, empty strings, and tests for errors in 70f7edf, is there something else you think should be handled?
///
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// assert!(WikipediaTitleNorm::from_url("https://en.wikipedia.org/not_a_wiki_page").is_err());
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// assert!(WikipediaTitleNorm::from_url("https://wikidata.org/wiki/Q12345").is_err());
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
/// ```
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
#[derive(Debug, PartialOrd, Ord, PartialEq, Eq, Hash)]
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
pub struct WikipediaTitleNorm {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
lang: String,
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
name: String,
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
impl WikipediaTitleNorm {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
fn normalize_title(title: &str) -> String {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
// TODO: Compare with map generator url creation, ensure covers all cases.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
title.trim().replace(' ', "_")
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
// https://en.wikipedia.org/wiki/Article_Title
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
pub fn from_url(url: &str) -> anyhow::Result<Self> {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
let url = Url::parse(url.trim())?;
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
let (subdomain, host) = url
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
.host_str()
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
.ok_or_else(|| anyhow!("Expected host"))?
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
.split_once('.')
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
.ok_or_else(|| anyhow!("Expected subdomain"))?;
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
if host != "wikipedia.org" {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
bail!("Expected wikipedia.org for domain")
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
let lang = subdomain;
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
let mut paths = url
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
.path_segments()
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
.ok_or_else(|| anyhow!("Expected path"))?;
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
let root = paths
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
.next()
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
.ok_or_else(|| anyhow!("Expected first segment in path"))?;
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
if root != "wiki" {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
bail!("Expected 'wiki' in path")
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
let title = paths
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
.next()
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
.ok_or_else(|| anyhow!("Expected second segment in path"))?;
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
let title = urlencoding::decode(title)?;
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
Self::from_title(&title, lang)
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
// en:Article Title
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
fn _from_osm_tag(tag: &str) -> anyhow::Result<Self> {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
let (lang, title) = tag
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
.trim()
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
.split_once(':')
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
.ok_or_else(|| anyhow!("Expected ':'"))?;
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
Self::from_title(title, lang)
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
pub fn from_title(title: &str, lang: &str) -> anyhow::Result<Self> {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
let title = title.trim();
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
let lang = lang.trim();
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
if title.is_empty() {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
bail!("title cannot be empty or whitespace");
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
if lang.is_empty() {
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
bail!("lang cannot be empty or whitespace");
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
let name = Self::normalize_title(title);
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
let lang = lang.to_owned();
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
Ok(Self { name, lang })
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.
}
biodranik commented 2023-06-22 21:45:20 +00:00 (Migrated from github.com)
Review
  1. Is it in English only now?
  2. Does it make sense to test this function?
1. Is it in English only now? 2. Does it make sense to test this function?
newsch commented 2023-06-22 23:33:17 +00:00 (Migrated from github.com)
Review

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it.

  1. Is it in English only now?

Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a .in_language.identifier field set to en, and the program writes that html to a QXXXXX/en.html file.

Running the program multiple times with different language dumps will fill in the various QXXXXX/$lang.html files.
We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now.

  1. Does it make sense to test this function?

The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases.
Do you think there should be more?

Oops, that comment slipped through when I handled multiple languages in 6e5385d. I'll remove it. > 1. Is it in English only now? Currently it will work with any language, but it only processes a single dump at a time. So when it reads an english dump, each article json has a `.in_language.identifier` field set to `en`, and the program writes that html to a `QXXXXX/en.html` file. Running the program multiple times with different language dumps will fill in the various `QXXXXX/$lang.html` files. We could extend it to process multiple dumps in parallel, but I don't expect there to be much of a speedup right now. > 2. Does it make sense to test this function? The doctest on the type definition verifies that the two constructors parse and normalize correctly, but doesn't check for various error cases. Do you think there should be more?
biodranik commented 2023-06-23 06:08:13 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?
  2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes.
newsch commented 2023-06-23 14:41:22 +00:00 (Migrated from github.com)
Review
  1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach?

I haven't tried it out yet, but there are a couple of options that come to mind:

  • Decompress the archives serially so they're all concatenated together into stdin. This looks possible with gunzip/tar, not sure about python pgzip.
  • As you say, run the program repeatedly through a wrapper script, using a for loop, xargs, parallel, etc.
  • Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool.
  1. It may make sense to check values that are in OSM. Users can make a lot of mistakes.

Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.

> 1. Does it imply a wrapping script that launches the app for each language/file? That is ok (and can be paralleled in bash by launching several processes simultaneously), but should be documented. Or is there a better approach? I haven't tried it out yet, but there are a couple of options that come to mind: - Decompress the archives serially so they're all concatenated together into stdin. This looks possible with `gunzip`/`tar`, not sure about python `pgzip`. - As you say, run the program repeatedly through a wrapper script, using a for loop, `xargs`, `parallel`, etc. - Pass the decompression command to the program and have it spawn the subprocess directly, it could do this in parallel and pass the results to the same worker pool. > 2. It may make sense to check values that are in OSM. Users can make a lot of mistakes. Understood. I have been using the world list of urls/ids from the map generator with no problems, but if we switch to using OSM data directly I'll rethink this. The program will log any issues it has parsing titles/QIDs.
biodranik commented 2023-06-23 17:16:40 +00:00 (Migrated from github.com)
Review

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

It should be ok to run tasks in parallel using bash for loop. We'll rethink it if necessary later.

45
src/wm/page.rs Normal file
View file

@ -0,0 +1,45 @@
use serde::Deserialize;
// TODO: consolidate into single struct
/// Deserialized Wikimedia Enterprise API Article
///
/// For all available fields, see <https://enterprise.wikimedia.com/docs/data-dictionary/>.
#[allow(dead_code)] // TODO: reevaluate fields
#[derive(Deserialize)]
pub struct Page {
// TODO: Check if CoW has a performance impact.
pub name: String,
pub date_modified: String,
pub in_language: Language,
#[serde(default)]
pub url: String,
pub main_entity: Option<Wikidata>,
// TODO: See what impact parsing/unescaping/allocating this has.
pub article_body: ArticleBody,
#[serde(default)]
pub redirects: Vec<Redirect>,
}
#[derive(Deserialize)]
pub struct Wikidata {
pub identifier: String,
}
#[derive(Deserialize)]
pub struct ArticleBody {
// TODO: Look into RawValue to lazily parse/allocate this:
// https://docs.rs/serde_json/latest/serde_json/value/struct.RawValue.html
pub html: String,
}
#[allow(dead_code)] // TODO: Reevaluate fields.
#[derive(Deserialize)]
pub struct Redirect {
pub url: String,
pub name: String,
}
#[derive(Deserialize)]
pub struct Language {
pub identifier: String,
}