Add option to dump new QIDs #20
No reviewers
Labels
No labels
bug
documentation
duplicate
enhancement
good first issue
help wanted
invalid
question
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference: organicmaps/wikiparser#20
Loading…
Add table
Reference in a new issue
No description provided.
Delete branch "dump-new-qids"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Closes #15
This implements the simplest method I discussed, saving QIDs for articles that are matched only by title to a file we can run the extractor with again.
Remaining work:
Churning through the english dump, there are 83 extracted pages without a QID, and 9528 articles matched by title with an unknown QID that would need to be resolved with another pass of other dumps.
log.txt
new_qids.txt
I wrongly assumed that WIKIDATA_IDS should contain "one per line...".
Can it be reworded? E.g. "Path to file that contains one per line Wikidata QID (e.g.
Q12345
)"ditto
Why is this file needed? What are the pros of using this option and cons of not using it? Can it be explained?
@ -22,0 +31,4 @@
<OUTPUT_DIR>
Directory to write the extracted articles to
Options:
Options imply that they are... optional. Will it work without options?
@ -22,0 +31,4 @@
<OUTPUT_DIR>
Directory to write the extracted articles to
Options:
Yes, but it won't extract any articles. Providing one or both of
--wikidata-ids
and--wikipedia-urls
will extract matching articles.My plan is that we can write the previously unknown QIDs to a file, and then run the program again by passing the same file to
--wikidata-ids
.So we run first run the extractor with the generator-provided urls and qids on all of the language dumps, and have them write the "new" qids to a file.
Then we run the extractor again on all the language dumps, but only with the new qid file.
Because of the problem covered in #15, each language's dump needs the "new" qids from every other language's dump.
So for the script I'm working on, it looks like:
@ -22,0 +31,4 @@
<OUTPUT_DIR>
Directory to write the extracted articles to
Options:
I'll move the two filter options to a new heading and return an error if neither are present.
@newsch if you write QIDs from different language dumps to different files, then they can be run/extracted in parallel, and loaded/merged later from a bunch of files.
That's true, but I think that opening in append mode and writing the entire line atomicly means it's already threadsafe.
No, it is not thread-safe. Why do you think is it?
Let me preface this by saying that I'm happy to write them to separate files.
My understanding is that under POSIX, each write syscall to a file is atomic (under a specific size), and that append mode atomicly moves to the end of the stream before each write.
There is some discussion on stack overflow, and the rust docs for append mention it:
From the
3posix write
manpage:Manpage excerpts
I set up an example in rust for this.
writeln!
with something that implementsDisplay
causes multiplewrite
syscalls logged in strace, and interleaves within each line (thewriteln_fmt
function below)format!
and then writing thatString
withwrite!
(thewrite_string_buffer
function below) leads to a singlewrite
syscall and no interleaving within lines.Output
Rust multithreaded example code
So it looks like it works, but the current wikiparser implementation calls
writeln!
directly, so if we stick with this it should be updated to write to aString
first.Ok, I added those changes in
65d97a59db