If you want to clone this repository to start your own project, you can choose the license you prefer and feel free to delete anything related to the license you are dropping.
- Docker Engine Installation Guide
- Java OpenJDK 11+ Installation Guide
./gradlew clean build
This project aims to have a multiple-module/pipeline, to achieve that, every pipeline should be something like this example-spark-runner This project have a few files:
- The main
/src/main/kotlin
- App.kt Defines the main() entrypoint
- /schema Contains all schema definitions
- /pipeline Contains the pipeline definition
- /ptranform Contains all the pipeline transformations (ptransforms)
- The test
src/test/kotlin
defines the unit tests - The test
src/integrationTest/kotlin
defines the integration tests
✅ : Done!
⚠️ : TODO
- Apache Spark Runner ✅
- Read/Write .CSV files ✅
- AWS S3 integration ✅
- Kafka integration
⚠️ - Oracle JDBC Integration ✅
To keep this template small, it only includes the Direct Runner for runtime tests, and Spark Runner as a integration test example. For a comparison of what each runner currently supports, look at the Beam Capability Matrix. To add a new runner, visit the runner's page for instructions on how to include it.
Pipeline: A Pipeline encapsulates your entire data processing task, from start to finish. This includes reading input data, transforming that data, and writing output data. All Beam driver programs must create a Pipeline. When you create the Pipeline, you must also specify the execution options that tell the Pipeline where and how to run.
PCollection: A PCollection represents a distributed data set that your Beam pipeline operates on. The data set can be bounded, meaning it comes from a fixed source like a file, or unbounded, meaning it comes from a continuously updating source via a subscription or other mechanism. Your pipeline typically creates an initial PCollection by reading data from an external data source, but you can also create a PCollection from in-memory data within your driver program. From there, PCollections are the inputs and outputs for each step in your pipeline.
PTransform: A PTransform represents a data processing operation, or a step, in your pipeline. Every PTransform takes one or more PCollection objects as input, performs a processing function that you provide on the elements of that PCollection, and produces zero or more output PCollection objects.
Thank you for your interest in contributing! All contributions are welcome! 🎉🎊
This software is distributed under the terms of both the MIT license and the Apache License (Version 2.0).
See LICENSE for details.