Joinutility seperatorLogin utility separator Infobright.com

Infobright Blog

09
Nov

Loading Data in Infobright on Amazon EC2 using DLP

Jeff Kibler's photo
by Jeff Kibler     Wed, Nov 09, 2011

Many of our (current and new) community members and customers ask about using Infobright on Amazon’s EC2 cloud environment.  A common question we receive is about the ability to use the cluster to distribute the load process into our Enterprise Edition.  For those using the trial version of Infobright Enterprise or if you are an IEE customer, here are some steps to set up and use the Distributed Load Processor (DLP) on EC2. (If you don't have a license for DLP you can download an eval version). I assume you already have an account on EC2, and I assume you’re comfortable with setting up your security groups.  The steps also assume you have two distinct instances set up on the cloud.

If you’d like to discuss Infobright and EC2, I invite you to post your questions on this Infobright Forum thread: http://bit.ly/ec2_infobright_forum

The Steps:

Definitions:

DLP EC2 Instance – The Amazon Server which will be hosting your distributed load processor binaries.

DB EC2 Instance – The Amazon Server which will be hosting your Infobright Enterprise Database.

Set up Permissions

  1. ON EC2 SECURITY GROUP: If you plan to use an FTP server to load binaries onto EC2, add your FTP Server to the Security Group on Ports 20-21
    1. Ex:  upload.sourceforge.net : 216.34.181.56/32 – Ports 20-21
  2. ON EC2 SECURITY GROUP: Ensure Inbound permissions on your Database EC2 instance for all DLP EC2 instances
    1. Ex: “Custom TCP Rule”, Port: 5029, Source: INTERNAL_IP_ADDRESS/32
  3. ON EC2 SECURITY GROUP:  Add inbound permissions for SSH into the instance from your laptop/home network
    1. Ex: “SSH” – YOUR_IP_ADDRESS/32
  4. ON EACH DLP EC2 INSTANCE: Use ssh-keygen –t rsa to generate private/public keys.  Add the DLP EC2 Instance public key to the authorized_keys on the Database EC2 instance’s ~/.ssh/authorized_key file
    1. On DLP Instance: Open ~/.ssh/id_rsa.pub with a text editor
    2. On DB Instance: Open ~/.ssh/authorized_keys with a text editor
  5. Copy and paste the DLP Instance key string into the DB Instance authorized_key file on a new line

Install the Database and DLP:

  1. Install Infobright DB, DLP, and keys/licenses if necessary on the appropriate servers.
    1. DB Server: license file goes into /usr/local/Infobright
    2. DLP Server: license file is listed in the command line
  2. Create the database and tables on the database
    1. CREATE DATABASE foo;
    2. CREATE TABLE bar (cola int, colb int)
  3. For the Infobright DB, add access permissions for the appropriate user from the IP address of the DLP EC2 Instance
    1. CREATE USER ‘BLAH’@’IP ADDRESS’ ;
    2. GRANT SELECT, INSERT, FILE ON *.* TO ‘BLAH’@’IP ADDRESS’;
  4. Create dummy data on DLP Server.
    1. > echo “1,2” >> ~/2col_data.csv

Load the data using DLP

  1. Ex:  > alias dlp-mydatabase=’ dataprocessor -E /opt/infobright/tools/distributed-load-processor/iblicense-dlp-eval.dat -v -H THE_IP_ADDRESS -D foo -T bar -f txt_variable --fields-terminated-by "," -L root -X -l /tmp/dlp.log -I file –I’

 By doing so, you can load data with either of the following commands:

> dataprocessor -E /opt/infobright/tools/distributed-load-processor/iblicense-dlp-eval.dat -v -H THE_IP_ADDRESS -D foo -T bar -f txt_variable --fields-terminated-by ',' -L root -X -l /tmp/dlp.log -I file -i ~/2col_data.csv

OR

> dlp-mydatabase ~/2col_data.csv

  1. –E tells you where the license key is
  2. –v tells you to be verbose in your output. As you get use to DLP, I recommend you keep this option ON.
  3. –H is the host ip
  4. –D is the database
  5. –T is the able
  6. –f is the format type
  7. –fields-terminated-by you should know
  8. –L is the user
  9. –X is execute immediately (as opposed to storing to a file, manually transferring, and manually loading)
  10. –l is the log file location
  11. –i is the input file

Test your data on the Database Instance


Previous Page