Joinutility seperatorLogin utility separator Infobright.com

Infobright Blog

27
Jun

Working with big data….

craigtrombly's photo
by craigtrombly     Wed, Jun 27, 2012

I just recently had the experience of working with a few datasets that were quite sizeable to manage on my laptop. And while I found the challenges similar to those that many of you have already experienced, I did identify a few steps that make things a little easier. Of the most important are process and preparation. I found that the more I knew about what I was going to need to do, the smoother the general process was in each of the steps.

Most of the common challenges people are familiar with surround getting the data prepared for an analysis environment or tool. Usually common ETL tools are used to do this, but what if you are preparing data for the first time and you are not sure what that process is? Outlining your requirements for each step of the process is key in avoiding common pitfalls.

* What is the data?
* Size of the data?
* Is it compressed?
* What are the staging environment storage requirements?
* Tools for analysis?
* Preparing the data
* Moving it around.
* Putting it in production.

In my scenario, I had a single file of Web log data that was 47GB in file size. I knew that with a file this size, no matter what decisions I made or outlined, it was going to take some time to accomplish each stage given the hardware resources I had (the aforementioned laptop). I did not have enough disk space to extract it and load it into a database on my computer and I would expect the same issue with a lot of people when working with file sizes like this. So then next step was to rar the file, which took about an hour and SCP it over to a local server in my network. Using an SSH program, i was able to unrar the file and then connect to an Infobright command prompt and start to load the data into a staging table.

It ultimately was 157 million records. Where I ran into problems were with the format of the line terminators of the file. Because it came from a Windows machine and I moved it down and rarred it onto a Linux box, the line terminators were different, but I did not find that out until 30 minutes into the first load data exercise. I made the correction and then 38 minutes later, I had a count of 157 million records...... Something to note, I did use the reject file parameters for the data load and was actually surprised that it encountered no data errors during the load.

Now I was able to use Toad and take a look at the actual data and sample it accordingly. This allowed me to decide what data points I could optimize and what points were important. I had to look into making this data more optimal for the analysis environment so it could be queried in a timely fashion. I made a few DDL changes with the table structure and was able to test the response time to my satisfaction.

Now since this was going to a production machine, i had to dump the data to a file, rar it up, send it to the server and load data again. I do remember thinking to myself, this was an exercise. By the end of the day, I had the file in a production database and quite optimized. Looking back at what I did to accomplish this taught me this:

1. Always flowchart each step and document it
2. Be patient, but know as much as you can about the data itself
3. Know your environment(s) implicitly
4. Be prepared for something to error and how to identify where the breakdown is

The key point I learned is process and preparation. Whether you are working with 47GB of data or 47TB, the point is still valid. Stay tuned and I'll let you know what I am doing with this data...

Infobright     Tags:

Data were great so great responsibility as well. You should be able to find a way to properly manage these data. There are many types of data that can be managed properly.
airline credit card

Author: mikaylajenks
Date: 05/21/13

my feedback displaying. is there a setting i’m missing? it’s doable it’s possible you’ll assist me out? oakleys sunglasses

Author: dailaolaa
Date: 05/16/13

The data is too big should be divided into sections that are very small. You should be able to find a lot of important parts and put it together when needed.
globaltelesis.com

Author: teddymichael
Date: 05/07/13

If you work with data stored on your computer, it helps if you include the extra security. In my opinion, the system additional security is important enough to be applied in the computer to prevent the actions of people who are not responsible.  Title Services

Author: MartinCores
Date: 05/06/13

Working with large data is very time-consuming for you. You must be familiar with the large data loading for a long time so when you are looking for data.
savage arms

Author: valentinobruno
Date: 05/02/13

I think that we all have faced such a situation at some point and I must admit that it took me some time to get the job done. For the safety of my data I use a finfisher surveillance program, to keep the viruses and hackers away.

Author: LaraC23
Date: 04/30/13

Big data is very hard to manage. Manual processing is out of the question, as it will take years to finish. You can code your own system or use commericla systems like SAP, OCIDOS, and VERINT.
women secrets

Author: Erik Maynes
Date: 04/22/13

This is a really good read for me, Must admit that you are one of the best bloggers I ever saw…genital warts treatment

Author: sammoore9
Date: 03/24/13

For a new person, it can be confusing. It should be checked for many times.
annuity settlement

Author: matildakruzack
Date: 03/24/13

Having a lot of data in the computer does not need a little space, but it was in the computer security system should be kept up to date. If indeed we have a lot of data to support the work, it is better if we set up more storage space in the computer. Neofame.com.my

Author: PolinNeros
Date: 03/19/13

Without having the data is valid and can be believed, I think everyone would not be easy to make a decision in the job. Therefore, the data storage in a computer must be ensured safe and can accommodate data capacity that we have.  website

Author: WandaSam
Date: 03/09/13

I do know this isn’t exactly on subject, but i have a web page utilizing the identical program as effectively and i am getting troubles with my feedback displaying. is there a setting i’m missing? it’s doable it’s possible you’ll assist me out? thanx.
online sports psychology degree

Author: josefclare
Date: 02/11/13

böcek ilaçlama | Thanks for this read mate. Well, this is my first visit to your blog! But I admire the precious time and effort you

Author: rudybullet
Date: 01/10/13

Truly most people now a day are relaying on the articles in the internet, but sometimes some of those questions were not answered but not when I visited your site and found great ideas on it. borehole cameras

Author: lisamoon10
Date: 12/18/12

It is really a difficult task. It is bit hard to work with big and huge data.

achat maison

Author: brianmark
Date: 07/10/12

Please login or register to post a comment.