pRODUCT dATAFACTORY

Prepare data pipelines without code or extra tools

DataFactory is everything you need for data integration and building efficient data pipelines

Transform raw data into information and insights in ⅓ the time of other tools

DataFactory is made for

self-service data ingestion

streaming

Real-Time Replication

transformations

cleansing

preparation

wrangling

machine learning modeling

Build Pipelines Visually

The days of writing pages of code to move and transform data are over. Drag data operations from a tool palette onto your pipeline canvas to build even the most complex pipelines.

  • Palette of data transformations you can drag to a pipeline canvas
  • Build pipelines that would take hours in code in minutes
  • Automate and operationalize using in-built approval and a version control mechanism

Data Wrangling & Transformations

It used to be that data wrangling was one tool, pipeline building was another, and machine learning was yet another tool. DataFactory brings these functions together, where they belong.

  • Perform operations easily using drag-and-drop transformations
  • Wrangle datasets to prepare for advanced analytics
  • Add & operationalize ML functions like segmentation and categorization without code

Complete Pipeline Orchestration

With DataFactory, you’ve got complete control over your pipelines — when they ingest, when they load targets — everything. It’s like your personal data pipeline robot doing your bidding.

  • Perform all ETL, ELT, and lake house operations visually
  • Schedule pipeline operations & receive notifications and alerts

Click into Details

DataFactory doesn’t sacrifice power for ease of use — open the nodes you’ve added to fine-tune operations exactly how you want them to perform. Use code or tune nodes using the built-in options.

  • Click on any pipeline node to see the details of the transformation
  • Adjust parameters with built-in controls or directly using SQL
  • Analyze and gain insights into your data using visualizations and dashboards

Model and maintain an easily accessible cloud Data Lake

If you’re worried about standing up a data lake just for analytics, worry no more. DataFactory includes it’s own data lake to save you time and money. No need to buy yet another tool just for analytic data storage.

  • Use for cold and warm data reporting & analytics
  • Save costs vs buying a separate data lake platform

Hundreds of Connectors

Batch data? Buy a tool. Streaming data? Buy another tool. That’s so 2018. DataFactory connects to any data, anywhere, without having to buy YAT (yet another tool).

  • Legacy databases
  • Modern Cloud databases
  • Cloud ERPs
  • REST API sources (Streaming / Batch)
  • Object Storage Locations
  • Flat Files

Whatever you need for your data engineering,
DataFactory is there for you

Capability Matrix

data sources

chevron right

connect

chevron right

Discover

chevron right

Ingest

chevron right

Processing

chevron right

Load

Use cases

What can you do with DataFactory? 

Quick Insight on Datasets

Use DB Explorer to query data for rapid insights using the power of Spark SQL engine 

Save results as data sets to be used as sources in building pipelines and as sources in data wrangling and ML modeling 

Easy Data Preparation

Use transformation nodes included in the tool palette to analyze the grain of the data and distribution of attributes

Use nulls and empty records, value statistics, and length statistics to profile data in sources for efficient join and union operations

Query-based CDC

Identify and consume changed data from source databases into downstream staging and integration layers in Delta Lake

Minimizes impacts on source systems via incremental changes identified using variable-enabled, timestamp-based SQL queries

Incorporate only the latest changes into your data warehouse for near real-time analytics and cost savings

Log-based CDC

Deliver real-time data streaming by reading the database logs for identifying the continuous changes happening to source data.

Optimize efficiency by using background processes to scan database logs to capture changed data without impacting transactions and minimizing impact on source

Easily configure and schedule using built-in log-based CDC tools

Anomaly Detection

Quickly perform data pre-processing or data cleansing to provide the learning algorithm a meaningful training dataset

Minimizes impacts on source systems via incremental changes identified using variable-enabled, timestamp-based SQL queries

Cap anomalies based on the built-in rule sets and set guardrails to achieve a higher accuracy percentage of quality data

Push-down Optimization

Execute pushdown optimization via push-down enabled transformation nodes so that transformation logic is pushed down to the source or target database

Easily configure push-down operations within transformation nodes

The team was so excited that we were able to do it in a fraction of the time and so effectively.

$940K saved annually by automating data quality across nine data sources.

14 FTEs saved through automation. 60% reduction in time needed to test data.

-->