Do not panic on missing .html #50

Open
petrzjunior wants to merge 1 commit from petrzjunior/missing-html into main
petrzjunior commented 2025-01-18 19:04:18 +00:00 (Migrated from github.com)

Fixes #49

This is an attempt to fix articles with missing article_body.html fields. I downloaded the Wikidata Enterprise export for several languages and I notices that some articles are indeed missing the HTML field.

For example this article only has wikitext, but no html:

{
  "name": "Linka B (metro v Praze)",
  "identifier": 92076,
  "date_modified": "2024-10-09T08:01:21Z",
  "url": "https://cs.wikipedia.org/wiki/Linka_B_(metro_v_Praze)",
  "in_language": {
    "identifier": "cs"
  },
  "main_entity": {
    "identifier": "Q1460442",
    "url": "https://www.wikidata.org/entity/Q1460442"
  },
  "is_part_of": {
    "identifier": "cswiki",
    "url": "https://cs.wikipedia.org"
  },
  "article_body": {
    "wikitext": "{{Různé významy|tento = lince pražského metra|stránka = B (linka)}}\n..."
  },
  ...
}

According to the Wikidata docs, this field is not required.

In the PR, I made the field optional and print a warning in case such article is parsed. Previously, it crashed the process.

cc @rtsisyk

Fixes #49 This is an attempt to fix articles with missing `article_body.html` fields. I downloaded the Wikidata Enterprise export for several languages and I notices that some articles are indeed missing the HTML field. For example this article only has `wikitext`, but no `html`: ```json { "name": "Linka B (metro v Praze)", "identifier": 92076, "date_modified": "2024-10-09T08:01:21Z", "url": "https://cs.wikipedia.org/wiki/Linka_B_(metro_v_Praze)", "in_language": { "identifier": "cs" }, "main_entity": { "identifier": "Q1460442", "url": "https://www.wikidata.org/entity/Q1460442" }, "is_part_of": { "identifier": "cswiki", "url": "https://cs.wikipedia.org" }, "article_body": { "wikitext": "{{Různé významy|tento = lince pražského metra|stránka = B (linka)}}\n..." }, ... } ``` According to the [Wikidata docs](https://enterprise.wikimedia.com/docs/data-dictionary/#article_body), this field is not required. In the PR, I made the field optional and print a warning in case such article is parsed. Previously, it crashed the process. cc @rtsisyk
Member

Thanks for the patch! It seems to work!

In the PR, I made the field optional and print a warning in case such article is parsed.

So what does ultimately happen to such articles? Are they ignored/skipped?

Thanks for the patch! It seems to work! > In the PR, I made the field optional and print a warning in case such article is parsed. So what does ultimately happen to such articles? Are they ignored/skipped?
Member

The fix works well @rtsisyk, all data had been processed successfully.

The fix works well @rtsisyk, all data had been processed successfully.
petrzjunior commented 2025-03-10 00:20:58 +00:00 (Migrated from github.com)

So what does ultimately happen to such articles? Are they ignored/skipped?

They are not present in the final dump.

> So what does ultimately happen to such articles? Are they ignored/skipped? They are not present in the final dump.
pastk approved these changes 2025-03-10 03:54:04 +00:00
Member

@rtsisyk merge?

@rtsisyk merge?
This pull request can be merged automatically.
You are not authorized to merge this pull request.
View command line instructions

Checkout

From your project repository, check out a new branch and test the changes.
git fetch -u origin petrzjunior/missing-html:petrzjunior/missing-html
git checkout petrzjunior/missing-html

Merge

Merge the changes and update on Forgejo.

Warning: The "Autodetect manual merge" setting is not enabled for this repository, you will have to mark this pull request as manually merged afterwards.

git checkout main
git merge --no-ff petrzjunior/missing-html
git checkout petrzjunior/missing-html
git rebase main
git checkout main
git merge --ff-only petrzjunior/missing-html
git checkout petrzjunior/missing-html
git rebase main
git checkout main
git merge --no-ff petrzjunior/missing-html
git checkout main
git merge --squash petrzjunior/missing-html
git checkout main
git merge --ff-only petrzjunior/missing-html
git checkout main
git merge petrzjunior/missing-html
git push origin main
Sign in to join this conversation.
No description provided.