Add option to dump new QIDs #20

Merged
newsch merged 5 commits from dump-new-qids into main 2023-07-13 18:04:52 +00:00
newsch commented 2023-07-11 16:23:50 +00:00 (Migrated from github.com)

Closes #15

This implements the simplest method I discussed, saving QIDs for articles that are matched only by title to a file we can run the extractor with again.

Remaining work:

  • wait for complete run to finish and check results
Closes #15 This implements the simplest method I discussed, saving QIDs for articles that are matched only by title to a file we can run the extractor with again. Remaining work: - [x] wait for complete run to finish and check results
newsch commented 2023-07-11 20:41:26 +00:00 (Migrated from github.com)

Churning through the english dump, there are 83 extracted pages without a QID, and 9528 articles matched by title with an unknown QID that would need to be resolved with another pass of other dumps.

log.txt
new_qids.txt

Churning through the english dump, there are 83 extracted pages without a QID, and 9528 articles matched by title with an unknown QID that would need to be resolved with another pass of other dumps. [log.txt](https://github.com/organicmaps/wikiparser/files/12021356/log.txt) [new_qids.txt](https://github.com/organicmaps/wikiparser/files/12021350/new_qids.txt)
biodranik (Migrated from github.com) approved these changes 2023-07-13 00:34:59 +00:00
biodranik (Migrated from github.com) commented 2023-07-13 00:28:44 +00:00

I wrongly assumed that WIKIDATA_IDS should contain "one per line...".
Can it be reworded? E.g. "Path to file that contains one per line Wikidata QID (e.g. Q12345)"

I wrongly assumed that WIKIDATA_IDS should contain "one per line...". Can it be reworded? E.g. "Path to file that contains one per line Wikidata QID (e.g. `Q12345`)"
biodranik (Migrated from github.com) commented 2023-07-13 00:28:51 +00:00

ditto

ditto
biodranik (Migrated from github.com) commented 2023-07-13 00:30:07 +00:00

Why is this file needed? What are the pros of using this option and cons of not using it? Can it be explained?

Why is this file needed? What are the pros of using this option and cons of not using it? Can it be explained?
@ -22,0 +31,4 @@
<OUTPUT_DIR>
Directory to write the extracted articles to
Options:
biodranik (Migrated from github.com) commented 2023-07-13 00:26:14 +00:00

Options imply that they are... optional. Will it work without options?

Options imply that they are... optional. Will it work without options?
newsch (Migrated from github.com) reviewed 2023-07-13 14:59:29 +00:00
@ -22,0 +31,4 @@
<OUTPUT_DIR>
Directory to write the extracted articles to
Options:
newsch (Migrated from github.com) commented 2023-07-13 14:59:28 +00:00

Yes, but it won't extract any articles. Providing one or both of --wikidata-ids and --wikipedia-urls will extract matching articles.

Yes, but it won't extract any articles. Providing one or both of `--wikidata-ids` and `--wikipedia-urls` will extract matching articles.
newsch (Migrated from github.com) reviewed 2023-07-13 15:11:09 +00:00
newsch (Migrated from github.com) commented 2023-07-13 15:11:08 +00:00

My plan is that we can write the previously unknown QIDs to a file, and then run the program again by passing the same file to --wikidata-ids.

So we run first run the extractor with the generator-provided urls and qids on all of the language dumps, and have them write the "new" qids to a file.

Then we run the extractor again on all the language dumps, but only with the new qid file.

Because of the problem covered in #15, each language's dump needs the "new" qids from every other language's dump.

So for the script I'm working on, it looks like:

for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
do
  tar xzf $dump | om-wikiparser \
    --wikidata-ids wikidata_ids.txt \
    --wikipedia-urls wikipedia_urls.txt \
    --write-new-qids new_qids.txt \
    descriptions/
done

for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz
do
  tar xzf $dump | om-wikiparser \
    --wikidata-ids new_qids.txt \
    descriptions/
done
My plan is that we can write the previously unknown QIDs to a file, and then run the program again by passing the same file to `--wikidata-ids`. So we run first run the extractor with the generator-provided urls and qids on all of the language dumps, and have them write the "new" qids to a file. Then we run the extractor again on all the language dumps, but only with the new qid file. Because of the problem covered in #15, each language's dump needs the "new" qids from every other language's dump. So for the script I'm working on, it looks like: ``` for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz do tar xzf $dump | om-wikiparser \ --wikidata-ids wikidata_ids.txt \ --wikipedia-urls wikipedia_urls.txt \ --write-new-qids new_qids.txt \ descriptions/ done for dump in $DUMP_DOWNLOAD_DIR/*-ENTERPRISE-HTML.json.tar.gz do tar xzf $dump | om-wikiparser \ --wikidata-ids new_qids.txt \ descriptions/ done ```
newsch (Migrated from github.com) reviewed 2023-07-13 15:15:57 +00:00
@ -22,0 +31,4 @@
<OUTPUT_DIR>
Directory to write the extracted articles to
Options:
newsch (Migrated from github.com) commented 2023-07-13 15:15:57 +00:00

I'll move the two filter options to a new heading and return an error if neither are present.

I'll move the two filter options to a [new heading](https://docs.rs/clap/latest/clap/struct.Arg.html#method.help_heading) and return an error if neither are present.
biodranik (Migrated from github.com) reviewed 2023-07-13 18:16:18 +00:00
biodranik (Migrated from github.com) commented 2023-07-13 18:16:18 +00:00

@newsch if you write QIDs from different language dumps to different files, then they can be run/extracted in parallel, and loaded/merged later from a bunch of files.

@newsch if you write QIDs from different language dumps to different files, then they can be run/extracted in parallel, and loaded/merged later from a bunch of files.
newsch (Migrated from github.com) reviewed 2023-07-13 21:16:52 +00:00
newsch (Migrated from github.com) commented 2023-07-13 21:16:51 +00:00

That's true, but I think that opening in append mode and writing the entire line atomicly means it's already threadsafe.

That's true, but I think that opening in append mode and writing the entire line atomicly means it's already threadsafe.
biodranik (Migrated from github.com) reviewed 2023-07-14 05:17:18 +00:00
biodranik (Migrated from github.com) commented 2023-07-14 05:17:18 +00:00

No, it is not thread-safe. Why do you think is it?

No, it is not thread-safe. Why do you think is it?
newsch (Migrated from github.com) reviewed 2023-07-14 14:42:23 +00:00
newsch (Migrated from github.com) commented 2023-07-14 14:42:23 +00:00

Let me preface this by saying that I'm happy to write them to separate files.

My understanding is that under POSIX, each write syscall to a file is atomic (under a specific size), and that append mode atomicly moves to the end of the stream before each write.

There is some discussion on stack overflow, and the rust docs for append mention it:

For most filesystems, the operating system guarantees that all writes are atomic: no writes get mangled because another process writes at the same time.

One maybe obvious note when using append-mode: make sure that all data that belongs together is written to the file in one operation. This can be done by concatenating strings before passing them to write(), or using a buffered writer (with a buffer of adequate size), and calling flush() when the message is complete.

From the 3posix write manpage:

Manpage excerpts

If the O_APPEND flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation.

This volume of POSIX.1‐2017 does not specify the behavior of concurrent writes to a regular file from multiple threads, except that each write is atomic (see Section 2.9.7, Thread Interactions with Regular File Operations). Applications should use some form of concurrency control.

Atomic/non-atomic: A write is atomic if the whole amount written in one operation is not interleaved with data from any other process. This is useful when there are multiple writers sending data to a single reader. Applications need to know how large a write request can be expected to be performed atomically. This maximum is called PIPE_BUF. This volume of POSIX.1‐2017 does not say whether write requests for more than PIPE_BUF bytes are atomic, but requires that writes of PIPE_BUF or fewer bytes shall be atomic.

I set up an example in rust for this.

  • Using just writeln! with something that implements Display causes multiple write syscalls logged in strace, and interleaves within each line (the writeln_fmt function below)
  • Using format! and then writing that String with write! (the write_string_buffer function below) leads to a single write syscall and no interleaving within lines.
Output
> head writeln_fmt
BCA00000000000000000000000000000000000000000000000
0

BC0A0000000000000000000000000000000000000000000101
1

BCA00000000000000000000000000000000000000000000202
2

BC0A00000000000000000000X00000000000000000000000000003000
$ head write_string_buffer
B0000000000000000
A0000000000000000
C0000000000000000
B0000000000000001
A0000000000000001
C0000000000000001
B0000000000000002
A0000000000000002
C0000000000000002
B0000000000000003
Rust multithreaded example code
//! Test appending lines to a file from multiple threads.
//!
//! Usage:
//!     cargo-eval test-append-write.rs FILE_PATH
use std::fmt::Display;
use std::thread;
use std::env::args;
use std::fs::File;
use std::io::Write;

struct ThreadNumber(char, usize);

impl Display for ThreadNumber {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> Result<(), std::fmt::Error> {
        write!(f, "{}{:016}", self.0, self.1)
    }
}

// Interleaves anywhere
fn writeln_fmt(f: &mut impl Write, d: &impl Display) {
    writeln!(f, "{}", d).unwrap()
}

// Interleaves d and newline
fn writeln_string_buffer(f: &mut impl Write, d: &impl Display) {
    let buf = format!("{}", d);
    writeln!(f, "{}", buf).unwrap()
}

// Interleaves nowhere
fn write_string_buffer(f: &mut impl Write, d: &impl Display) {
    let buf = format!("{}\n", d);
    write!(f, "{}", buf).unwrap()
}

fn main() {
    let file = args().nth(1).unwrap();

    let handles: Vec<_> = ('A'..='Z').map(|label| {
        let file = file.clone();
        thread::spawn(move || for num in 0..1000 {
            let mut f = File::options().create(true).append(true).open(&file).unwrap();
            let msg = ThreadNumber(label, num);

            // writeln_fmt(&mut f, &msg);
            // writeln_string_buffer(&mut f, &msg);
            write_string_buffer(&mut f, &msg);
        })
    }).collect();

    for handle in handles {
        handle.join();
    }
}

So it looks like it works, but the current wikiparser implementation calls writeln! directly, so if we stick with this it should be updated to write to a String first.

Let me preface this by saying that I'm happy to write them to separate files. My understanding is that under POSIX, each write syscall to a file is atomic (under a specific size), and that append mode atomicly moves to the end of the stream before each write. There is [some discussion on stack overflow](https://stackoverflow.com/questions/1154446/is-file-append-atomic-in-unix), and [the rust docs for append](https://doc.rust-lang.org/std/fs/struct.OpenOptions.html#method.append) mention it: > For most filesystems, the operating system guarantees that all writes are atomic: no writes get mangled because another process writes at the same time. > One maybe obvious note when using append-mode: make sure that all data that belongs together is written to the file in one operation. This can be done by concatenating strings before passing them to write(), or using a buffered writer (with a buffer of adequate size), and calling flush() when the message is complete. From [the `3posix write` manpage](https://www.man7.org/linux/man-pages/man3/write.3p.html): <details> <summary> Manpage excerpts </summary> > If the `O_APPEND` flag of the file status flags is set, the file offset shall be set to the end of the file prior to each write and no intervening file modification operation shall occur between changing the file offset and the write operation. > This volume of POSIX.1‐2017 does not specify the behavior of concurrent writes to a regular file from multiple threads, except that each write is atomic (see [Section 2.9.7, Thread Interactions with Regular File Operations](https://pubs.opengroup.org/onlinepubs/9699919799/functions/V2_chap02.html#tag_15_09_07)). Applications should use some form of concurrency control. > _Atomic/non-atomic_: A write is atomic if the whole amount written in one operation is not interleaved with data from any other process. This is useful when there are multiple writers sending data to a single reader. Applications need to know how large a write request can be expected to be performed atomically. This maximum is called `PIPE_BUF`. This volume of POSIX.1‐2017 does not say whether write requests for more than `PIPE_BUF` bytes are atomic, but requires that writes of `PIPE_BUF` or fewer bytes shall be atomic. </details> I set up an example in rust for this. - Using just `writeln!` with something that implements `Display` causes multiple `write` syscalls logged in strace, and interleaves within each line (the `writeln_fmt` function below) - Using `format!` and then writing that `String` with `write!` (the `write_string_buffer` function below) leads to a single `write` syscall and no interleaving within lines. <details> <summary> Output </summary> ``` > head writeln_fmt BCA00000000000000000000000000000000000000000000000 0 BC0A0000000000000000000000000000000000000000000101 1 BCA00000000000000000000000000000000000000000000202 2 BC0A00000000000000000000X00000000000000000000000000003000 ``` ``` $ head write_string_buffer B0000000000000000 A0000000000000000 C0000000000000000 B0000000000000001 A0000000000000001 C0000000000000001 B0000000000000002 A0000000000000002 C0000000000000002 B0000000000000003 ``` </details> <details> <summary> Rust multithreaded example code </summary> ```rust //! Test appending lines to a file from multiple threads. //! //! Usage: //! cargo-eval test-append-write.rs FILE_PATH use std::fmt::Display; use std::thread; use std::env::args; use std::fs::File; use std::io::Write; struct ThreadNumber(char, usize); impl Display for ThreadNumber { fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> Result<(), std::fmt::Error> { write!(f, "{}{:016}", self.0, self.1) } } // Interleaves anywhere fn writeln_fmt(f: &mut impl Write, d: &impl Display) { writeln!(f, "{}", d).unwrap() } // Interleaves d and newline fn writeln_string_buffer(f: &mut impl Write, d: &impl Display) { let buf = format!("{}", d); writeln!(f, "{}", buf).unwrap() } // Interleaves nowhere fn write_string_buffer(f: &mut impl Write, d: &impl Display) { let buf = format!("{}\n", d); write!(f, "{}", buf).unwrap() } fn main() { let file = args().nth(1).unwrap(); let handles: Vec<_> = ('A'..='Z').map(|label| { let file = file.clone(); thread::spawn(move || for num in 0..1000 { let mut f = File::options().create(true).append(true).open(&file).unwrap(); let msg = ThreadNumber(label, num); // writeln_fmt(&mut f, &msg); // writeln_string_buffer(&mut f, &msg); write_string_buffer(&mut f, &msg); }) }).collect(); for handle in handles { handle.join(); } } ``` </details> So it looks like it works, but the current wikiparser implementation calls `writeln!` directly, so if we stick with this it should be updated to write to a `String` first.
biodranik (Migrated from github.com) reviewed 2023-07-15 00:59:35 +00:00
biodranik (Migrated from github.com) commented 2023-07-15 00:59:35 +00:00
  1. Atomic appending works only for a file opened in the append mode.
  2. Atomic write is limited by max PIPE_BUF size (4kb on Linux).
  3. As you mentioned, it requires one write syscall to avoid a mess.
  4. It can be used if the order of written data is not important.
  5. Using this approach is OK if it simplifies the implementation, and it is properly implemented/documented, with all restrictions in mind.
1. Atomic appending works only for a file opened in the append mode. 2. Atomic write is limited by max PIPE_BUF size (4kb on Linux). 3. As you mentioned, it requires one write syscall to avoid a mess. 4. It can be used if the order of written data is not important. 5. Using this approach is OK if it simplifies the implementation, and it is properly implemented/documented, with all restrictions in mind.
newsch (Migrated from github.com) reviewed 2023-07-17 19:36:31 +00:00
newsch (Migrated from github.com) commented 2023-07-17 19:36:31 +00:00

Ok, I added those changes in 65d97a59db

Ok, I added those changes in https://github.com/organicmaps/wikiparser/pull/21/commits/65d97a59dba4cd2314aa99e2efb2385c9e562006
Sign in to join this conversation.
No description provided.