GitHub - produpedia-org/dbpedia-import: Scripts for importing data into Produpedia.org from latest DBpedia dataset

This repo contains some of the few helper scripts to transform the DBpedia dataset into Produpedia's database (also downloadable from https://produpedia.org/static/download.html).

Generally, this process works but is scattered accross multiple files, even repositories. It could probably be unified all into one script. To reproduce it all, here are the necessary steps:

*.coffee files below refer to scripts meant to be run with deno. Deno does not yet support transpiling coffeescript, so you need to run npm run coffee -c -b *.coffee beforehand and run the resulting .js file. (I know)

product and subject mean the same thing.

Spin up a local DBpedia instance
- See https://www.dbpedia.org/resources/latest-core/ and https://github.com/dbpedia/virtuoso-sparql-endpoint-quickstart
- Because there is no paging implemented in the scripts and the operations can break computation limits, you should increase the thresholds inside virtuoso.ini in your instance, for example the ridiculously high values from virtuoso.ini.modifications. But also, please read the documentation about NumberOfBuffers and MaxDirtyBuffers in your generated virtuoso.ini file.
- The collection in use is https://databus.dbpedia.org/phil294/collections/dbpedia-latest-core-produpedia (mirrored at databus-collection-mirror.json)
- You will also need:
  - The missing properties givenName, surname and gender from the 2016 (!) persondata dataset because they are missing in the latest ones. You could probably just add the whole dataset to the collection (or put it into the downloads folder manually). When I did it I greped for those three properties only, for some reason.
  - https://databus.dbpedia.org/dbpedia/wikidata/images because they are missing (see "Current issues") and https://databus.dbpedia.org/dbpedia/wikidata/sameas-all-wikis/ to connect both datasets (used in get_products.coffee. The SameAs all wikis is rather big, you can also extract it and grep for the lines referencing the english wiki only with grep '<http://dbpedia.org/resource/' > outfile
  - With the above steps, the resulting virtuoso-db folder will be about 14 GB in size.
Get categories.json, the json representation of the browsable category tree
- You can either just take the one from the data repo. This is the maintained one. Or, if you want to do it yourself,
- Create and maintain one yourself, e.g. using
  - get_ontology_classes_tree.coffee. This gets all ontology classes listed at http://mappings.dbpedia.org/server/ontology/classes (archive.org mirror)
  - Manual edits (compare: Edits to categories.json over time)
  - additional-categories-manually: uses gold:hypernym to print out further interesting categories. Only a helper script, does not extend categories.json for you. Every time before you use it, you should also have run categories.ts because it populates the DB with the categories from categories.json. Both scripts are in the main repo.
Import categories.json into the main repo DB with categories.ts. It requires you to have the mongodb database set up (documented over there).
Generate category aliases with categories-aliases. The output should probably be integrated into categories.json above instead
Generate categories_dump.json with e.g. mongoexport -d database_name -c category --jsonArray --pretty > categories_dump.json. Necessary because the up to date categories state now lies in the DB.
Use it to run get_products.coffee (in this repo again). It produces products.txt. Please note that this script consumes a lot of RAM because all >4 million products with their categories stay in memory. You will need something along the lines of deno run --unstable --v8-flags=--max-old-space-size=8192 get_products.js. Can easily take half an hour or so.
Run attributes.ts. It gets (more or less) all DBpedia attributes and saves them to the db.
Run products.ts. It reads products.txt, gets all relevant values in small batches and saves them to the db. Can take many hours.
Create the DB indexes listed in the comments at the top of Product.ts, Attribute.ts and Category.ts.
Generate Category.showers and Category.products_count properties with categories-showers. This script also eats up a lot of ram.

If you actually do all of that, you might experience some minor errors because I haven't all done it again myself. Just open a issue and we will resolve it quickly.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
databus-collection-mirror.json		databus-collection-mirror.json
determine_ontology_only_categories_and_labels.coffee		determine_ontology_only_categories_and_labels.coffee
get_ontology_classes_tree.coffee		get_ontology_classes_tree.coffee
get_products.coffee		get_products.coffee
global.coffee		global.coffee
package.json		package.json
query.coffee		query.coffee
virtuoso.ini.modifications		virtuoso.ini.modifications
yarn.lock		yarn.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

License

produpedia-org/dbpedia-import

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages