Preparing Your Data with Bandit

What is Bandit?

Bandit is a tool that provides automated, high-speed, smart numeric data banding (or binning). Bandit automatically transforms raw numeric data into categorical data for analysis projects and pipelines. Users analyzing numeric data can easily pre-process data using Bandit's built-in formulas or create custom bands. 

Bandit can work across multiple operating systems, including:

  • Redhat Linux
  • OSX
  • Windows

Bandit is installed inside of EmcienPatterns, but can also be delivered through a command line interface to aid in automation. To download the latest version of Bandit, contact Emcien support.

What kind of input file is required?

A delimited text file in wide format containing the input data you want to band. The example in this article the diabetes research data set found in our Sample Data Sets.

Web Bandit is accessable from the menu in the top right of the screen, or at the bottom right of the 'upload data' screen.

From the landing page, drag and drop or select a file that you would like to band. The following dialog box will appear:

Drop-Down Box A allows you to specify the variable you would like to predict. Choosing 'Select one (optional)' will band the file with no dependent outcome.

Toggling Box B allows the default settings to be overriden. Within this toggle you can: Specify for a validation file to be generated, change the delimiter of the file (the default is comma), choose the compression type for the output files, and adjust the input file type (either wide or long).

Toggling Box C allows you to set user-defined bands to a file. See the section 'Setting User-Defined Bands on a File' below for more information on how to define user bands.

Upon a successful run, you can download the banded file and breaks file in .gzip format. As EmcienPatterns can accept .gzip files, the user can upload directly to patterns without decompressing after downloading the file.

Oftentimes you will want to control some aspect of the banding yourself. This includes situations where a banded file has:

  • columns you would like to exclude
  • columns that were banded that should be treated as categorical
  • columns that you want to override with another system-defined band, your own customize bands, and/or strings

Follow these simple instructions to create a user bands file. Open up the breaks_diabetes_input_data_file.csv file that was just created. Sample:

Column A:  Categories within your data
 Column B:  Method of Banding used
 Column C:  How the data in the column was transformed
 Column D-E:  Band range
 Column F:  Number of rows in the data that contain theat band
 Column G:  Renaming the bands

Options to create user bands file

Always begin by editing the original breaks file (breaks_diabetes_input_data_file.csv) from a previous Bandit run on desired dataset. 

  1. To apply a system defined band:
    • Delete all but one line for each category where applying a system defined band.
    • In column B of the category being changed, type one of the system defined banding options and delete the remaining content of that row  

    • Save this file as something other than breaks_diabetes_input_data_file.csv
    • Run Bandit again with your new user-defined bands file as the -b bands file
      • Example:  bandit -t diabetes_input_data_file.csv -f "Diabetes" -b user_input_file.csv
    • Bandit will apply the specified banding method to the dataset. In this case, the freq banding method with a limit of three bands.  
  2. To create a user defined band:
    • replace the banding method in column B with user
    • type in the new min and max in columns D and E respectively
  3. Applying labels to bands:
    • replace the banding method in column B with user
    • in column G type in the numerical or text string you want to represent that band

Note: Always save the modified breaks file (user bands file) under a different name. This is important because if the breaks file name is not change the file will be written over.

Sample user bands file:

Here is an example of Bandit command passing in a user-bands file:

bandit -t diabetes_input_data_file.csv -d "Diabetes" -b user_input_file.csv

The above example will:

  1. Ingest the input file (-t diabetes_input_data_file.csv)
  2. Use the category "Diabetes" (-d "Diabetes") as the dependent variable.
  3. Uses the user bands file (-b user_input_file.csv) to specify how the data should be transformed.
  4. Return the following output files:
    • Banded file (banded_diabetes_input_data_file.csv)
      • User banded file optimized for analysis
    • Breaks file (breaks_diabetes_input_data_file.csv)
      • Contains the bands and methods used during user banding process.

You must unzip the package and save the Bandit executable file to a location on your computer. You can rename the file to "Bandit" to simplify use of the processing commands. Renaming is not required, but the correct file name is required in the processing commands.

 -h   help:   describe the options and version 
 -d CHAR   dependent:   name of the dependent category 
 -i CHAR   input:   form of input file (W for wide or R for Long) 
 -f CHAR   form:   form of output file (W for wide or R for Long) 
 -s CHAR   separator:   Separator character 
 -t CHAR   table:   name of table file 
 -b CHAR   bands:   name of bands file 
 -g int   guess:   1 for normal; 2 for clever 
 -p int   prune:   0 for none; 1 for pruning 
 -r INT   randomize:   1 for 10pct test; 2 for 20pct test 
 -v INT   verbosity:   Print out lots of extra information on stderr 

There are 9 transformations you can feed to a breaks file when banding the data. They are:

  1. freq - Similar frequency bands - five bands are chosen to divide frequency of transactions into similar sizes. This command can include the number of bands you wish to have. (i.e. “freq8” would create 8 bands based on frequency)
  2. wide - Equal width bands - range of numerical values is divided evenly into five bands. This command can include the number of bands you wish to have. (i.e. “wide7” would create 7 equal width bands)
  3. fuzz - Fuzzy bands organizes the data into five bands with sizes of approximately 10%, 20%, 40%, 20%, 10% of the transactions. The algorithm begins by placing the highest occurring value into the middle 40% and builds left and right of that to create the other bands.
  4. stat - Builds bands based on the standard deviation
  5. half - Builds bands based on half of the standard deviation
  6. info - Information Banding - based on "information theory", recursively breaks down the data into chunks with the goal to reduce the entropy/disorder (basically trying to cleanly organize bands such that predictability is maximized). This banding type REQUIRES the dependent parameter (“-d”) to be passed in the command line.
  7. user - User defined bands allow for custom banding of the numeric ranges and assigning them names. “User” type bands REQUIRE the use of a “bands.csv” input file. See section on User Defined Banding for more information.
  8. cat - Categorical Data command tells Bandit to leave the data alone and pass it onto the output file as is.
  9. skip - Skips the column of data and does not include it in the output file at all.

Processing data using Bandit involves creating the input files, then passing them along into Bandit using a command line parameter. Here is an example using the diabetes data set:

bandit -t diabetes_input_data_file.csv

The above example will:

  1. Ingest the input file (-t diabetes_input_data_file.csv)
  2. Return a banded file (banded_diabetes_input_data_file.csv) optimized for analysis and a breaks file (breaks_diabetes_input_data_file.csv) containing bands and methods used.

When using Bandit to process data that contains a defined outcome (dependent) category, you will use the -d parameter to pass in the dependent category name. Here is an example:

bandit -t diabetes_input_data_file.csv -d "Diabetes"

The above example will:

  1. Ingest the input file (-t diabetes_input_data_file.csv)
  2. Use the category "Diabetes" as the dependent (-d "Diabetes") variable.
  3. Return the following output files:
    • Banded File (banded_diabetes_input_data_file.csv)
      • Auto-banded file optimized for analysis.
    • Breaks File (breaks_diabetes_input_data_file.csv)
      • Contains the bands and methods used during the auto banding process.