Introduction¶

Dataduct is a wrapper built on top of AWS Datapipeline which makes it easy to create ETL jobs. All jobs can be specified as a series of steps in a YAML file and would automatically be translated into datapipeline with appropriate pipeline objects.

Features include:

Visualizing pipeline activities
Extracting data from different sources such as RDS, S3, local files
Transforming data using EC2 and EMR
Loading data into redshift
Transforming data inside redshift
QA data between the source system and warehouse

It is easy to create custom steps to augment the DSL as per the requirements. As well as running a backfill with the command line interface.

An example ETL from RDS would look like:

name: example_upsert
frequency: daily
load_time: 01:00  # Hour:Min in UTC

steps:
-   step_type: extract-rds
    host_name: test_host
    database: test_database
    sql: |
        SELECT *
        FROM test_table;

-   step_type: create-load-redshift
    table_definition: tables/dev.test_table.sql

-   step_type: upsert
    source: tables/dev.test_table.sql
    destination: tables/dev.test_table_2.sql

This would first perform an extraction from the RDS database with the extract-rds step using the COPY ACTIVITY. Then load the data into the dev.test_table in redshift with the create-load-redshift. Then perform an upsert with the data into the test_table_2.