Skip to content

Generate a local TPC-H dataset of arbitrary size and file chunks and upload to AWS S3 bucket

License

Notifications You must be signed in to change notification settings

matwerber1/tpch-dbgen-to-aws-s3

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

Purpose

Provide a script that makes it easy to generate a TPC-H data set and upload the output files to an AWS S3 bucket.

The generated files are chunked to support the parallel load capabilities provided by Amazon Redshift, AWS Glue, AWS EMR, etc.

The collection of files generated for each table are placed in their own S3 prefix which is necessary for certain services such as Redshift Spectrum.

Instructions

  1. Clone the tpc-h dbgen utility
git clone https://github.com/electrum/tpch-dbgen

cd tpch-dbgen

make

cd..
  1. Clone this project
git clone https://github.com/matwerber1/tpch-dbgen-to-aws-s3
  1. Edit the configuration variables in run.sh (e.g. S3_BUCKET=???)

  2. Run the script

cd tpch-dbgen-to-aws-s3
./run.sh

About

Generate a local TPC-H dataset of arbitrary size and file chunks and upload to AWS S3 bucket

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages