Creating a Data Factory

The Emcien Connectivity and Scheduling modules are optional additions to our software that stitch together the EmcienPatterns and EmcienPredict modules to create a “data factory”

The modules rely on two key concepts to function: Data Virtualization and an Application Scheduler. Key benefits include:

Emcien Connectivity

  • Access all your data from hundreds of different types of databases, file types and RESTful API data sources

  • Join, transform, and make a moving window of related data


Emcien Scheduler

  • Automate everything from data collection and analysis to reporting and workflows

  • Create alerts that proactively notify users about fundamental changes in the data


This module enables access all of your data and join related data easily by creating standardized data sources and virtualized data views.

Standardized Data Sources

Through the use of over 100 different data connectors -- ranging from flat files to SQL DBs, NoSQL DBs to cloud services -- Emcien Connectivity makes it possible to make nearly any data source or type appear like it is a SQL table that can be queried like any other SQL table.


With this new capability, every data source can be accessed using the exact same SQL syntax, making searches and cross-data-silo joins very easy, even when the data sources are completely different types of data.

Creating a Data Source

A data source can be created using a simple wizard that only requires a few basic inputs about the type of data source, necessary authentication information, and the name for the new data source.

Example Creating a SQL DB Data Source

To create a connection with a SQL DB, choose ‘Add New Data Source,’ then choose the ‘MySQL’ type connector under JDBC, then input the database connection information. At this point, this MySQL table can be treated just like any other data source, even though it will not import and save any data. It will simply access that data when Emcien products query it.

Example Creating a CSV File Data Source

To create a SQL connection with a CSV flat file, choose ‘Add New Data Source,’ then choose the ‘File’ type connector, then select the file to connect to. At this point, the file can be treated like a SQL table or view, even though it will not import that data. It will simply access that data when Emcien product query it.

Virtualized Data Views

With virtualized views, nothing happens until the moment a query is made with EmcienScan or EmcienPatterns. When a query occurs, the data is transformed, joined and streamed as the records come in from the source database -- no need to touch the source data or have a large data warehouse to keep copies of the data.

Always up-to-date, these virtual views allow the user to:

  • Create virtual tables that represent data across data silos (without copying the data)

  • Create moving windows of data

  • Create modified versions of data within a table, without altering the source

  • Provide on-the-fly “risk-free” cleansing for data for analysis (data source not altered)

  • Analyze data from a JSON API as if it were a table of data

These views appear within EmcienScan as a series of what appear to be physical tables or views, when in reality those views are created on-the-fly when analyzed with EmcienScan or EmcienPatterns, avoiding all the downsides of a traditional data warehouse.

Creating a Virtual View

A virtual view can be created using standard SQL across one or more data sources, just as if they were located within a single physical data warehouse.

Joining 3 different data sources into a single view is no harder than joining 3 tables within a single database. Once the “CREATE VIEW” SQL statement is run, the virtual view is ready to be used within Emcien products.

An example of having EmcienScan discover connections across 3 different data sources would look something like the following diagram if we wanted to create a virtual view called “cross_silo_customer_view”.

As shown in the diagram, EmcienScan would show a single view called “cross_silo_customer_view” that it could analyze for correlated columns, outliers, and provide a full data profile, all for a virtual view that isn’t persisted anywhere on any server.

This module enables you to automate everything from data collection and analysis to reporting and workflows, and create alerts that proactively notify users about fundamental changes in the data.

Emcien modules -- EmcienScan, EmcienPatterns, EmcienPredict -- all have APIs. Emcien Scheduler can access these modules through their APIs to orchestrate them, making them run continuously or according to a predetermined schedule.

As a result, Emcien Scheduler can update virtual views of data, run data scanning and push computed meta-data into a reporting data warehouse for a continuous view of what’s happening within your data streams every day.

Scheduler Tasks

Through Emcien Scheduler, nearly any repeating process of data movement, analysis or prediction can be created. These tasks can be any type of SQL, calls to REST API endpoints, even command line actions -- whatever is needed for a recurring data process in any environment.

Scheduling a Task

An example would be re-scanning a database table every 30 minutes. EmcienScan has a “rebuild” API. By simply having the Scheduler call EmcienScan’s “rebuild” API endpoint, the scan process is now repeatable. The desire to re-scan is so common that it is available as one of the many out-of-the-box tasks that come with Emcien Scheduler. Simply changing the ID of the scan to rebuild and putting a valid API authentication token into the existing text of the call will create a task.

Even easier, to make this task run every 30 minutes, simply choose the new task and choose ‘Schedule’. A wizard provides a myriad of ways to decide when this task will be executed. In our case we simply type in 30 minutes and click ‘Finish’. Now we can re-scan the database table twice per hour.

As mentioned, Emcien Scheduler comes with a set of the most common repeating tasks that customers need. Simply edit an existing task to include a customer API token key and a repeating process will be running in minutes.

For more information, contact