Dataduct is a wrapper built on top of AWS Datapipeline which makes it easy to create ETL jobs. All jobs can be specified as a series of steps in a YAML file and would automatically be translated into datapipeline with appropriate pipeline objects.
- Visualizing pipeline activities
- Extracting data from different sources such as RDS, S3, local files
- Transforming data using EC2 and EMR
- Loading data into redshift
- Transforming data inside redshift
- QA data between the source system and warehouse
It is easy to create custom steps to augment the DSL as per the requirements. As well as running a backfill with the command line interface.
An example ETL from RDS would look like:
name: example_upsert frequency: daily load_time: 01:00 # Hour:Min in UTC steps: - step_type: extract-rds host_name: test_host database: test_database sql: | SELECT * FROM test_table; - step_type: create-load-redshift table_definition: tables/dev.test_table.sql - step_type: upsert source: tables/dev.test_table.sql destination: tables/dev.test_table_2.sql
This would first perform an extraction from the RDS database with the
extract-rds step using the
COPY ACTIVITY. Then load the data
dev.test_table in redshift with the
create-load-redshift. Then perform an
upsert with the data into