Use wikidata to complete seeds #50

sebastian-nagel · 2022-10-18T13:30:54Z

Initially, the news crawler was seeded with URLs from news sites from DMOZ, see #8 for the procedure. DMOZ isn't updated anymore, but Wikidata could be a replacement to complete the seed list:

select all instances of newspaper (news media, or similar) having an official website:

SELECT DISTINCT ?item ?itemLabel ?lang ?url
WHERE
{ 
  ?item wdt:P31/wdt:P279* wd:Q11032.
  ?item wdt:P856 ?url.  # with official website
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,de,ru,fr,es,it,ja,zh,*" }
  OPTIONAL {
     ?item wdt:P407 ?language.
     ?language wdt:P220 ?lang.
   }
}
LIMIT 50

(execute query on Wikidata query service)

tfmorris · 2023-11-16T19:53:12Z

Wikidata based seed URLs will probably require some significant deduplication, filtering, reranking, etc, but here's a version of the query which adds the language of the URL to account for sites which have different base URLs for different languages, like Blick. It also expands the language list (because * doesn't work), but it could be generalized more. As an example of the type of filtering needed, the Hubei Daily item has three URLs - a corporate site, an e-paper, and a 404.

SELECT DISTINCT ?item ?itemLabel ?lang ?worklang ?url WHERE {
  ?item (wdt:P31/(wdt:P279*)) wd:Q11032;
    p:P856 ?statement.
  ?statement ps:P856 ?url.
  OPTIONAL {
    ?statement pq:P407 ?worklanguage.
    ?worklanguage wdt:P220 ?worklang.
  }
  OPTIONAL {
    ?item wdt:P407 ?language.
    ?language wdt:P220 ?lang.
  }
  SERVICE wikibase:label { bd:serviceParam wikibase:language "en,de,uk,ru,fr,es,it,ja,zh,ar,hu,pt,be,rus,ce,br,cs,sv,dk,da,he,fi,nb,id,eu,pl,nl,az,mar,lv,hr,am,ba,r". }
}
LIMIT 100

Query
As of today, there are 11,177 results. There are more than 200 languages represented, plus a couple of thousand sites with no language tag, and that distribution looks like about what you'd expect (the two letter codes represent TLDs, not language codes, eg. hk, ru, uk, de, au, cn, etc):

eng	3562
fra	826
spa	586
rus	467
deu	316
ita	177
ara	168
ukr	166
fin	152
zho	146
jpn	145
swe	140
nor	122
hk	112
ru	112
por	108
hun	103
nld	93
uk	90
de	86
kor	86
au	78
cn	78
pol	66
hin	60
bel	59

tfmorris · 2024-12-10T01:06:10Z

The query above from a year ago now returns 156K URLs vs 11K before, but a more specific query for just news websites (Q17232649) returns a very tractable 3553 URLs for ~2500 unique entities covering 90+ languages. 571 of the sites don't include any language info, but that should be pretty easy to figure out just be fetching the home page.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use wikidata to complete seeds #50

Use wikidata to complete seeds #50

sebastian-nagel commented Oct 18, 2022

tfmorris commented Nov 16, 2023 •

edited

Loading

tfmorris commented Dec 10, 2024

Use wikidata to complete seeds #50

Use wikidata to complete seeds #50

Comments

sebastian-nagel commented Oct 18, 2022

tfmorris commented Nov 16, 2023 • edited Loading

tfmorris commented Dec 10, 2024

tfmorris commented Nov 16, 2023 •

edited

Loading