Creating an ETL

Dataduct makes it extremely easy to write ETL in Data Pipeline. All the details and logic can be abstracted in the YAML files which will be automatically translated into Data Pipeline with appropriate pipeline objects and other configurations.

Writing a Dataduct YAML File

To learn about general YAML syntax, please see YAML syntax. The structure of a Dataduct YAML file can be broken down into 3 parts:

  • Header information
  • Description
  • Pipeline steps

Example:

# HEADER INFORMATION
name : example_emr_streaming
frequency : one-time
load_time: 01:00  # Hour:Min in UTC
topic_arn: 'arn:aws:sns:example_arn'
emr_cluster_config:
    num_instances: 1
    instance_size: m1.xlarge
    bootstrap:
        string: "s3://elasticmapreduce/bootstrap-actions/configure-hadoop,--yarn-key-value, yarn.scheduler.maximum-allocation-mb=9500"

# DESCRIPTION
description : Example for the emr_streaming step

# PIPELINE STEPS
steps:
-   step_type: extract-local
    path: data/word_data.txt

-   step_type: emr-streaming
    mapper: scripts/word_mapper.py
    reducer: scripts/word_reducer.py

-   step_type: transform
    script: scripts/s3_profiler.py
    script_arguments:
    -   --input=INPUT1_STAGING_DIR
    -   --output=OUTPUT1_STAGING_DIR
    -   -f

Header Information

The header includes configuration information for Data Pipeline and the Elastic MapReduce resource.

The name field sets the overall pipeline name:

name : example_emr_streaming

The frequency represents how often the pipeline is run on a schedule basis. Currently supported intervals are hourly, daily, one-time:

frequency : one-time

The load time is what time of day (in UTC) the pipeline is scheduled to run. It is in the format of HH:MM so 01:00 would set the pipeline to run at 1AM UTC:

load_time: 01:00  # Hour:Min in UTC

In your config file, you have the option of specifying a default Amazon Resource Name that will be messaged if the pipeline fails, if you would wish to override this default ARN, you may use the topic_arn property:

topic_arn: 'arn:aws:sns:example_arn'

If the pipeline includes an EMR-streaming step, the EMR instance can be configured. For example, you can configure the bootstrap, number of core instances, and instance types:

emr_cluster_config:
    num_instances: 1
    instance_size: m1.xlarge
    bootstrap:
        string: "s3://elasticmapreduce/bootstrap-actions/configure-hadoop,--yarn-key-value, yarn.scheduler.maximum-allocation-mb=9500"

Note: Arguments in the bootstrap step are delimited by commas, not spaces.

Description

The description allows the creator of the YAML file to clearly explain the purpose of the pipeline.