EmcienPatterns is a predictive analytics engine designed to operate stand-alone and embedded. This document provides performance metrics for Emcien. Included in this document are both general scaling mechanisms as well as detailed performance results for a variety of datasets that demonstrate many use-cases.
Emcien baseline tests are performed on a mixture of both public and private datasets. This enables Emcien to provide a broader spectrum of benchmarks. In cases where the data set is publicly available, reference links are provided.
The benchmark includes fixed and variable length data sets as described below:
Standard Row/Column Data
Also called fixed length, typically used for analytics benchmarking.
Variable Length Transactional Data
e.g. Sales Point of Sale data, Supply Chain, etc. Emcien can analyse in native format.
Each data set is described with the following metrics to describe the size and shape of the datasets as well as the performance and scaling of the engine.
|Rows of Data||Total number of rows of data analyzed (no sampling)|
|Columns of Data||Number of columns/features in the dataset|
|Analysis Time: Speed of Rules Generation||Analysis time for software to extract all of the If/Then rules that represent the ‘predictive model’ of the full dataset|
|Prediction Time: Speed to Predict New Data||Time it takes to predict a batch of new transactional data using only the rules. The data for predictions can be batch or streaming.|
|Resources: Peak Memory Usage||RAM required to build the graph in the analysis engine|
|Resources: Number of Rules||Number of If/Then Rules generated for high-accuracy predictions. The software tests the accuracy of the rules on a hold-out sample in the analysis phase.|
|Resources: Size of the rules data set (MB)||Size of the generated rules file. In addition to showing the data compression ratio, this indicates the ease of distributing rules files to remote IOT devices.|
Emcien–Designed to Scale
EmcienPatterns is comprised of two engines, the Analysis Engine, which extracts the If/Then rules that make up the ‘model’, and the runtime Prediction Engine, that uses the If/Then rules to make outcome decisions and provide supporting ‘reasons’ for each prediction.
Analysis Engine Scaling
The Analysis engine uses symmetric multiprocessing (SMP) as a way to analyze large quantities of data. This method involves increasing the number of processors/cores for a physical (or virtual) machine to allow the software to run parallel computations during the knowledge graph creation and exploration steps. Because of the compact nature of a Knowledge Graph (instead of just the raw rows/cols data), analyzing a graph in parallel is extremely efficient. Adding processors provides a near linear speed increase to the process.
Prediction Engine Scaling
The Prediction Engine is very fast, as it is only responsible for applying the pre-computed If/Then rules from the Analysis Engine to new incoming data. The Prediction Engine supports horizontal scaling across many machines. This enables running the prediction engine on as many different servers or IOT edge devices as needed, in parallel. The size of the predictive rules is very small and hence the rules and the run-time execution prediction engine can be pushed to all of the endpoints to allow for completely independent predictions in parallel.
Other Unique Scaling Abilities
The key to solving any big problem is to first break it down into base components. Unlike other solutions that try to brute force the analysis of large data sets, such as running queries against huge database tables, EmcienPatterns employs a few unique capabilities that allow it to take big problems and analyze them in a much more condensed form.
- Automatic Feature Selection: Without the need for user intervention, Emcien identifies and removes columns and values that have little to do with the issue being analyzed. This reduces the complexity and scope of the problem, while additionally removing noise that would otherwise detract from a clear understanding of significant factors.
- Automatic Data Binning: Emcien can transform numeric data into distinct ranges that have optimal predictive potential. Rather than needing to contextualize individual values, numbers are automatically grouped together into bands that produce consistent results. This feature can transform large ranges of low-frequency values into compact and significant patterns.
- Graph Analysis: Converts any and all source data into a graph (nodes with arcs connecting to other nodes) of related items before trying to discover patterns. Instead of analyzing millions of transactions in a database, data is first connected together on a graph where valuable relationships are instantly discovered by the engine.
Hardware Configuration for Benchmark
This paper provides quantitative benchmarks for multiple data sets. To enable the reader to duplicate and/or compare results, all the benchmarks are performed on a standard configuration that has been published.
Benchmark Machine Profile
|Clock Speed||2.4 GHz|
|Processor Type||Intel(R) Xeon(R) CPU E5-2609 0 @ 2.40GHz|
|Total Memory||128 GB|
|Operating System||Ubuntu 14.04.5 LTS|
Some of the datasets included in this benchmark are publically available so comparisons can be easily done with alternative methods and technologies. We have also included non-public datasets in this evaluation to provide a wider range of types of data, covering many use-cases and problems our customers have solved using the EmcienPatterns solution.
The use-cases these datasets covered include:
- DDOS: Network Traffic “denial of service” data set
- Predictive Maintenance: Distribution Vehicles that are monitored for breakdowns to predict failures before they occur
- Automotive IOT: Monitoring car sensors to determine what drives peak efficiency
- Credit Spend Data: Analyzing credit card data spend for high spend patterns
- Airline Traffic: Data tracking the locations (lat/lng), altitudes and other metrics of commercial flights all over the world to predict ascent/descent rates (public)
- Website Traffic: Discover patterns of activity related to different days of the week
- DNS: Predicting record types based on other DNS data collected
- Factory Sensors: Predicting failure of factory devices based on combinations of hundreds of sensors sending measures back each minute.
- Retail Sales: Point of Sale data used to discover product affinities and product bundles
Speed & Scale Results
This table captures the key metrics involving the size of the data, speed of the computations and scale of the resources utilized during the analysis. Further details of the datasets and use-cases is described later in this document.
- 8,500 Predictions per Second: The average prediction rate is over 8,500 transactions per second.
- 22 Minutes Average Analysis Time: The average time required to generate rules for 9 different use-cases and datasets was 22 minutes on an 8-core server.
- 99.8% Source-to-Rule Compression: All Datasets had a source data to rules size compression of 99.8% or greater.
- Longest Time: The Predictive Maintenance use-case needed 1 hour and 20 minutes to produce rules for 15 different outcomes.
- Peak RAM: The one High RAM usage was the Website Traffic dataset that included large numbers of long website URL strings and other metadata.
- Retail Sales Objective: This analysis produced Affinities and Bundles/Kits to recommend instead of Rules to predict. Prediction speed was not measured as recommendations are precomputed and provided instantaneously on web stores.
- Retail Sales Structure: This data was in POS format (variable length transactions), not the traditional rows/cols “wide” format needed for most other analysis systems. Analyzing this type of data is difficult for most other systems.
More Dataset Details
“DDOS” - Distributed Denial of Service
This dataset is a public dataset that contains network traffic data involving Distributed Denial of Service attacks on a Network. The objective with this dataset is to classify the 5 different types of attacks using the metrics such as packet rate, number of packets, utilization and timings. This data is public and available here.
This dataset contains transportation details about hundreds of distribution vehicles. Details include sensor data from the real-time use of the vehicles and the downtimes experienced. Discover both the systemic and individual patterns of usage involved with downtime of the vehicles, predicting both what will fail and when they will fail.
This data captures vehicle specifications combined with real-time sensor data over time. The objective is to understand what drives the efficiency of the vehicles for the purpose or suggesting optimal conditions.
Credit Card Spend Data
This data captures the credit card purchases by customers at different stores. The objective is to discover what drives customers to visit stores repeatedly and drive higher spend for the purpose of providing incentives.
This data has commercial airline information involving planes around the world including speed, altitude, whether they were taking off, cruising, descending or landing. One of the many objectives is to be able to predict the vertical ascent or descent rate based on flight numbers, routes, types of aircraft, etc. Data similar to this is available online from providers such as: http://www.airnavsystems.com
This dataset includes website actions on a high volume public website with many thousands of users. Articles and Pages are updated constantly, products are bought by customers. Discover the patterns of user activity based on the day of the week. Customer Segmentation by day of week can allow for the optimal planning of content, products and updates to the site.
This dataset includes DNS resolution records including source and destination IPs, Hostnames, etc. Discover the patterns of activity as it related to DNS record types. Some of the goals is to find nefarious activity as well as usage patterns to consider by the network team.
This dataset includes over 500 sensors sending data every minute measuring everything from speeds, temperatures, voltages, etc. Discover what combinations of sensors can predict the failures that have recently occurred.
This is Point of Sale data that has variable length customer transactions (receipts). Almost 40,000 product SKUs are captured in the data. Extract product level affinities for cross-selling as well as generating product bundles/kits. Using the affinities data within the on-line webstore, real-time recommendations can increase customer revenue.