Automating Data Analysis and Predictions with a “Data Factory”
What is a “Data Factory”?
Today’s Industrial Factories take in raw materials & parts then they produce finished goods and products. Factories do this with 3 driving principles:
Having the ability to make the same product more than once relies on the fact that you have turned your product into a series of steps, ingredients and processes that can be repeated over and over.
Once a factory is setup and working, the next stage is to find ways to automate redundant tasks and processes such that it reduces both workforce needs but also increases quality and reliability. Some see full automation as the ideal goal, removing manual steps from the complete process.
If you follow steps 1 & 2 above, you will have the opportunity to increase the amount your factory produces.
So what does this have to do with a “Data Factory”?
A “Data Factory” has similar driving principles, but with the raw material being data, and the products being produced being predictions and insights. With this in mind the Data Factory needs to be setup in a similar way as an industrial factory, with the goal of repeatability, automation and scale.
EmcienPatterns provides both the Engine and the Automation APIs to make the “Data Factory” a reality. It allows for complete automation, no human intervention with dramatic scale. Even more exciting is the fact that you can have as many factories as you want. Each targeting different use-cases, datasets and workflows.
Setting up a “Data Factory”
This is a quick overview of how easy it is to setup a recurring data analysis and prediction “factory” that can autonomously learn on new incoming data and make predictions in batch or real-time. These predictions can be controlling systems or updating dashboards. This process is also referred to as “Operationalizing” your data.
Lets take a look at the 3 Steps needed to create a “Data Factory” and “Operationalize” the results:
Step 1. Setup / Configure
This will differ based on whether the system is run on-premise or in the Emcien Cloud. For this document we are assuming the customer is running an on-premise VM.
This step consists of:
- Setting up the EmcienPatterns VM on VMware or VirtualBox
- Configuring the Users and Groups on the System
- Creating API Keys for Automation User(s)
You can find more about these steps here on the Knowledge Base: https://support.emcien.com/help/installing-on-a-vmware-virtual-machine
Step 2. Data Feeds
The data feeds are determined by the number or concurrent use-cases, data streams and other customer objectives. Let's assume just one use-case for now, say Predicting “Machine Failure” but this process is completely repeatable for as many use-cases as you need.
Analyzing Historical Data to Extract Rules
Here is the high level process to automatically analyze data in a Factory:
Now let’s examine the details. For this example we are making predictions for a use-case of “Machine Failure” (why are my machines failing in the lab and can I predict it in the future?).
Query (SQL): To do this we create a data feed that is simply a SQL query from the source database or data warehouse into either a JSON data file or CSV Data file. Because we want to have the system learn each week on recent failures, we need to make sure we have our query use a Moving Window of data using a date range in the WHERE CLAUSE of the query. (example looking at data from 1 year ago until today)
SELECT transaction_id, transaction_date, Machine_Attribute1, … FROM Machine_Measures_Table WHERE transaction_date BETWEEN (DATEADD(Year,-1,GETDATE()) AND (GetDATE())
Set this query up to export into one of the EmcienPatterns supported file formats and you are ready to go. The recurring feeds will not need any further human data work going forward so any manual work to setup the queries is a one time effort.
Move (SFTP): To do the analysis, EmcienPatterns simply requires you move the data into working area of the server. To do this is a simple command line SFTP command. You will simply need the automation user you created in the first step.
Step 3. Predictions, Operationalization & Reporting
Recurring Data Feeds and Analysis is only helpful if the results are easily consumable by the team members that need it in the places they typically work. As such, Emcien Patterns provides a full RESTful API set that allows for all of the metadata, rules and predictions it creates to be exported to other places such as Enterprise Data Warehouses, Workflows Systems and Reporting Systems.
Making Predictions in Batch
To make Batch Predictions is the same basic steps as doing an Analysis:
Making Real-time Predictions
If we wanted to make a Real-Time Prediction it is even easier:
These predictions can be made from any system that has the ability to make HTTP REST calls. This means you can be making predictions within other systems such as CRM, ERP and Supply Chain Management.
For more details see the API link in the upper right corner of the home page within EmcienPatterns.
Pushing Results into Dashboards and Workflows
There are many ways to integrate and this overview will assume the most common approach of integrating results back into a Data Warehouse for inclusion in other reports and workflows.
The basic overview to all integrations is to leverage the RESTful APIs and pull the results back into a Data Warehouse or directly into a workflow system.
Automating the “Data Factory”
All of the steps above represent the different processes that go on within the factory. To automate it we need to have these processes called in order, pulling data in, analyzing it, making predictions and pushing the results out into another system.
To do this there are many ways, but the easiest way is using CRON. CRON provides the scheduling and script calling ability to create a Repeatable, Automated and Scalable process that we can rely on to provide valuable predictions and insights 24x7.
CRON has a simple command line interface that provides both the schedule of when to run, and the functions to call scripts such as those mentioned above.
If you are not familiar with CRON we suggest you start here: https://en.wikipedia.org/wiki/Cron
From Concept to Production
While this document covers the basic concepts of the Data Factory, Emcien also provides a free public template that can be used by our implementation partners as well as any internal customer team that wants a well thought out example of how to integration and automate data analysis and prediction.
Contact our Support Team for more info on how to download a free copy of this reference architecture that describes how to repeatedly roll out many use-case factories within your organization.