-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Use wikidata to complete seeds #50
Comments
Wikidata based seed URLs will probably require some significant deduplication, filtering, reranking, etc, but here's a version of the query which adds the language of the URL to account for sites which have different base URLs for different languages, like Blick. It also expands the language list (because * doesn't work), but it could be generalized more. As an example of the type of filtering needed, the Hubei Daily item has three URLs - a corporate site, an e-paper, and a 404. SELECT DISTINCT ?item ?itemLabel ?lang ?worklang ?url WHERE {
?item (wdt:P31/(wdt:P279*)) wd:Q11032;
p:P856 ?statement.
?statement ps:P856 ?url.
OPTIONAL {
?statement pq:P407 ?worklanguage.
?worklanguage wdt:P220 ?worklang.
}
OPTIONAL {
?item wdt:P407 ?language.
?language wdt:P220 ?lang.
}
SERVICE wikibase:label { bd:serviceParam wikibase:language "en,de,uk,ru,fr,es,it,ja,zh,ar,hu,pt,be,rus,ce,br,cs,sv,dk,da,he,fi,nb,id,eu,pl,nl,az,mar,lv,hr,am,ba,r". }
}
LIMIT 100 Query
|
The query above from a year ago now returns 156K URLs vs 11K before, but a more specific query for just news websites (Q17232649) returns a very tractable 3553 URLs for ~2500 unique entities covering 90+ languages. 571 of the sites don't include any language info, but that should be pretty easy to figure out just be fetching the home page. |
Initially, the news crawler was seeded with URLs from news sites from DMOZ, see #8 for the procedure. DMOZ isn't updated anymore, but Wikidata could be a replacement to complete the seed list:
The text was updated successfully, but these errors were encountered: