Skip to content
This repository has been archived by the owner on Apr 1, 2024. It is now read-only.

Scripts for importing data into Produpedia.org from latest DBpedia dataset

License

Notifications You must be signed in to change notification settings

produpedia-org/dbpedia-import

Repository files navigation

This repo contains some of the few helper scripts to transform the DBpedia dataset into Produpedia's database (also downloadable from https://produpedia.org/static/download.html).

Generally, this process works but is scattered accross multiple files, even repositories. It could probably be unified all into one script. To reproduce it all, here are the necessary steps:

*.coffee files below refer to scripts meant to be run with deno. Deno does not yet support transpiling coffeescript, so you need to run npm run coffee -c -b *.coffee beforehand and run the resulting .js file. (I know)

product and subject mean the same thing.

  1. Spin up a local DBpedia instance
  2. Get categories.json, the json representation of the browsable category tree
  3. Import categories.json into the main repo DB with categories.ts. It requires you to have the mongodb database set up (documented over there).
  4. Generate category aliases with categories-aliases. The output should probably be integrated into categories.json above instead
  5. Generate categories_dump.json with e.g. mongoexport -d database_name -c category --jsonArray --pretty > categories_dump.json. Necessary because the up to date categories state now lies in the DB.
  6. Use it to run get_products.coffee (in this repo again). It produces products.txt. Please note that this script consumes a lot of RAM because all >4 million products with their categories stay in memory. You will need something along the lines of deno run --unstable --v8-flags=--max-old-space-size=8192 get_products.js. Can easily take half an hour or so.
  7. Run attributes.ts. It gets (more or less) all DBpedia attributes and saves them to the db.
  8. Run products.ts. It reads products.txt, gets all relevant values in small batches and saves them to the db. Can take many hours.
  9. Create the DB indexes listed in the comments at the top of Product.ts, Attribute.ts and Category.ts.
  10. Generate Category.showers and Category.products_count properties with categories-showers. This script also eats up a lot of ram.

If you actually do all of that, you might experience some minor errors because I haven't all done it again myself. Just open a issue and we will resolve it quickly.

About

Scripts for importing data into Produpedia.org from latest DBpedia dataset

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published