Joinutility seperatorLogin utility seperator Infobright.com
   
 
Creating database and tablespaces
Posted: 18 November 2008 01:25 PM   Ignore ]  
Newbie
Rank
Total Posts:  3
Joined  2008-11-18

Hi
Is it possible to create table spaces and data files in infobright?
How are really big db managed where TBs of data is stored?
Is everything stored in one big tablespace in the data directory?

Or when creating a db in infobright are such matters not considered?
I’m a little bit puzzled about setting up a big database in terms of making the data storage files tablespaces
and even partitioning the data on the disk?

Thanks
J

Profile
 
Posted: 18 November 2008 04:13 PM   Ignore ]   [ # 1 ]  
Newbie
Rank
Total Posts:  34
Joined  2008-08-18

Hi James

>Or when creating a db in Infobright are such matters not considered?
It is not considered.

We have option “ClusterSize (in megabytes)” can be set in brighthouse.ini file. The cluster size is the maximum size of each data file to be created to store compressed data. If the size exceeds, it creates new one. The defaults cluster size is 2GB. If compression ratio is 10:1, then for every 20GB data, you will see a new data file.

The data files are stored inside folder tablename.bht/ in database data folder. brighthouse.ini can be found in the data folder too.

regards

Signature 

Mahib
Software Developer
Infobright

Profile
 
Posted: 18 November 2008 06:31 PM   Ignore ]   [ # 2 ]  
Newbie
Rank
Total Posts:  34
Joined  2008-08-18

To give you more insight about data files, I should also mention that Infobright is a column oriented data warehouse. Each column data goes to a different data file. Therefore, being a columner and using ClusterSize parameter, we do horizontal and vertical partitioning of a table.

How large is your table? We would like hear your experience with our server.

Signature 

Mahib
Software Developer
Infobright

Profile
 
Posted: 18 November 2008 06:54 PM   Ignore ]   [ # 3 ]  
Jr. Member
Avatar
RankRank
Total Posts:  85
Joined  2008-08-18

Hi James,

I’ll try and throw my $0.02 into this thread.

Basically, Infobright relies on the operating system for the physical management of data on the server.  As Mahib mentioned, you can control the file size by setting the clustersize variable.  However, most leave it at 2 GB, which is the default for Linux.

When it stores the compressed data, IB partitions by column and then by size.  So, if you have a column that gets 10:1 compression, then approximately 20 GB of data will be stored for that column in the first file.  Once it’s exceeded, it will start another file.  You will find that, for large amounts of data, some columns will be stored in eight or nine files, and others in only one or two.  This is because the compression varies by column - some have extremely high compression, while others may be much lower.

I’m not sure if this has answered your question.  We’ve tried to design the solution so that you don’t have to worry about table spaces, by utilizing system files and by our high compression rates (a 1 TB file typically compresses to 100 GB, which is much easier to manage).

I’m looking forward to your further thoughts and comments

Signature 

John Kemp
Infobright

Profile
 
Posted: 19 November 2008 10:15 AM   Ignore ]   [ # 4 ]  
Newbie
Rank
Total Posts:  11
Joined  2008-09-17

Data Warehousing as an Enterprise investment, performance is only one of most important points, such as as managability, functionality.
In my opinion, it will take the ICE team much more efforts to make ICE a reasonable choice, such as:
1. Insert/update/delete statement support.
2. Custom storage layout mechanism.
3. Full level backup/restore mechanism.
4. Master/Stand-by server/process articheture to allow hardware or software failures.
5. Data replication mechanism.
6. Scale out support.

Profile
 
Posted: 20 November 2008 04:55 PM   Ignore ]   [ # 5 ]  
Jr. Member
Avatar
RankRank
Total Posts:  85
Joined  2008-08-18

Hi Amber,

Thank you for providing your thoughts on our solution.  I’ll try to address your comments here, and would be more than happy to have a further discussion with you if you would like.

Infobright took the approach of focusing on delivering a high performance, high compression data store that delivers cost effective analytics on high volumes of data. 

Now that I’ve got the marketing speak out of the way, let me give you my thoughts on our approach and why I believe that we do support the functionality that you have requested.

First off, full DML support is provided in the IEE version of the product.  However, using ETL processes, you can eliminate the need for IUD by undertaking your processing outside of ICE, and then loading the transformed data.

We intentionally built our solution to work within the MySQL environment (we are sort of an engine within MySQL; however, we do have a few differences such as the data loader).  This gives us the ability to leverage many of the tools available in the MySQL environment, both from other MySQL partners and MySQL themselves.

And there’s a certain advantage to this: I’ve often seen ‘built in’ functions like backups and monitoring included in offerings that are subsequently never used, since organizations often have another offering already in place.  Case in point:  most ETL tools now have some pretty good monitoring tools.  But in reality, they are rarely used in practice since you are already using tool ‘x’ and don’t need it.

So, for things like backup and restore, you can utilize system tools like some of the Linux backup tools.  Or you can use a product like Zmanda (open source).

And for things like data replication and master/slave failover, there are tools like the MySQL proxy server that can support replication, manage heartbeats and failover, and can even ‘police’ queries such that blatantly slow queries (select * against a billion row table, for example) can be blocked.  All of these tools are open source and are readily available, with strong community support.

The biggest thing to consider when rolling out an ‘enterprise’ data warehouse is what is really required? 

Backups have to be done. Restoring data due to a hardware failure is something that I have rarely had to do.  However, restoring data due to human error is far more common (the “Oops, I shouldn’t have loaded that” scenario).  And none of the replication software or redundant hardware you can put in place will prevent that.

One of the first things I do is really nail down what the service requirements are.  For a DW that’s being used by many people on a 7X24 basis, you do need failover, since loss of the DW likely leads to an immediate cost to the business.  However, for a solution being used by a small department, you may not need to have 7X24 availability.  In fact, downtime of even a business day may be acceptable!  In this case, I would eliminate the redundancy since Infobright runs on ‘off the shelf’ Intel hardware that can be easily acquired – and restoration of the DW can be undertaken quickly (for example, Infobright’s typical compression of a 1 TB database to 100 GB makes the backups and restores quick!).

I’ll come back to a couple of points you raised that I didn’t address earlier (sorry for jumping around):

We don’t provide a custom storage layout function because, quite frankly, we don’t think there is a need for it and it runs counter to our philosophy of a simple data warehouse that can be easily managed.  And since, through our compression and knowledge grid, we typically store data in 1/10 of the size of other organizations, the need for determining where to physically put tables is negated.

Regarding scale out support, our solution is designed to ‘ride the hardware curve’, so to speak.  We scale out on concurrent queries via the number of cores in a solution.  So, if you have a single dual core cpu, you will be limited to two concurrent queries.  However, scaling out the cores to eight will allow you to run eight concurrent queries without any impact on performance.  We are continuing to work on our concurrent query performance to scale beyond eight concurrent queries without material performance degradation.

I trust that this answers your questions.  I would be happy to engage in a further discussion with you at your convenience.

Regards,

Signature 

John Kemp
Infobright

Profile
 
Posted: 22 November 2008 03:39 AM   Ignore ]   [ # 6 ]  
Newbie
Rank
Total Posts:  11
Joined  2008-09-17

Yes, backups have to be done, but not the entire the DW database, most times only the most important tables have to be done quickly. Using the current avaliable backup method, can ICC server put a table online if I provide all the pack files back on the Operating System level, and what about the Knowlege Gird data of that table?

Most data transform are done in ETL tools in the past, but now another method is ELT, which means transform can be done in the database server. In ICE, an SELECT ..... CREATE/REPLACE/APPEND INTO TARGET TABLE statement support would be very nice, and from the implementation perspactive of ICE, it is not so difficutic:)

Profile
 
Posted: 22 November 2008 04:31 AM   Ignore ]   [ # 7 ]  
Jr. Member
Avatar
RankRank
Total Posts:  97
Joined  2008-08-18

Hi Amber,

ICE is well suited for differential backups. Data are divided into 2 GB files and only the last one is modified on data load (plus metadata files, which are small). Doing a differential backup of database/table folder is sufficient for maintain data accessibility, but you’re right: to keep performance, Knowledge Grid data also should be in the backup schedule. These reside in BH_RSI_Repository folder in the main database directory.

Regards,

Signature 
Profile
 
Posted: 22 November 2008 10:22 AM   Ignore ]   [ # 8 ]  
Newbie
Rank
Total Posts:  11
Joined  2008-09-17

Back to the originally question, if the current disk can’t hold more data, how can new disks be plugged-in, manually creating links? That’s not a good idea.

Profile
 
Posted: 22 November 2008 12:04 PM   Ignore ]   [ # 9 ]  
Newbie
Avatar
Rank
Total Posts:  16
Joined  2008-10-20

If the problem is you are running out of disk capacity, you can add more disks to a logical volume using Logical Volume Manager (LVM).

The issue is more of who should manage disk space for a database, the DBA or the Sysadmin. There are pro’s and con’s to both and I can certainly see your point of view, but as one sysadmin once told me, “when your only tool is hammer, you tend to see every problem as a nail”.

Profile
 
Posted: 24 November 2008 04:02 AM   Ignore ]   [ # 10 ]  
Sr. Member
Avatar
RankRankRankRank
Total Posts:  109
Joined  2008-08-18

Hi !
If you are running out of the disk space in ICE, then you would run out of it 5-30x faster if using other products (see e.g. http://www.infobright.org/Forums/viewthread/274/). While it is possible that other products have a built-in capability to add other storage locations on the fly, isn’t it simpler and more appropriate to plan the storage requirements in advance? Especially if the requirements are not high?  Surely, after several months one may need to add storage anyway. Then, as Carl wrote, delegate the problem to LVM - it is free, proved, flexible and widely used. Surely the initial setup should be done with LVM to enable smooth expansion. Or you can shutdown, copy the datadir folder to a new, bigger disk, and restart. For 1TB database it would be around 100GB to copy - no problem.

Profile
 
Posted: 30 November 2008 10:49 AM   Ignore ]   [ # 11 ]  
Newbie
Rank
Total Posts:  3
Joined  2008-11-18

Hi
Okay I’m starting to some evaulation on infobright. Will it support the following scenario.

1 Load data into the Infobright db from cvs/t file generated from a Mysql Db export.
2 Query this data
3 Load more data into the same tables and continue querying at the same time.
4 Perform deletes on the Infobright db and again be able to perform queries at the same time.

My aim is make a 24/7 on line InfoBright reporting database populated with dumps based on time queries from a hight transaction data base(Mysql ndbcluster).
But I need full HA etc on all. Hence I need to be sure infobright can handle queries and updates such a deletes while being loaded.

By the way to load, I persume if i don’t want to do any translation on the data I just make the Infobright tables the same as the myslq tables and load the data in the dump file.
How can I do delete by date?

Any advice of suggestions welcome.
J

Profile
 
Posted: 30 November 2008 06:59 PM   Ignore ]   [ # 12 ]  
Jr. Member
Avatar
RankRank
Total Posts:  56
Joined  2008-08-18

Hi James,

A couple of things. I’m not sure if DB export provides the correct format for the ICE loader. I will need to double check on that.

As for loading, the ICE transaction model is an exclusive table lock, so queries are blocked during a load. However if the loads are relatively small (and ICE loader is very quick), many of our customers do periodic loads and query quite successfully in between the load operations. They don’t even notice that loads are occurring.

ICE doesn’t support DELETE. However you could consider “partitioning” the tables into date ranges. For instance, have one table per day say. Create a view that UNION’s the tables together so they look like one table to a end user. Then you can DROP TABLE to remove the older data.

You mentioned HA. That would take a bit more discussion depending on your requirements. If you want a “cool standby”, you can use a shared SAN’s for data storage and a proxy server. If the “master” fails, the secondary machine can be brought on line (proxy server re-routes activities) within a few seconds. There are several other solutions but would need to know more about your requirements to suggest some alternatives.

Good luck.  -V

Signature 
Profile
 
Posted: 30 November 2008 07:46 PM   Ignore ]   [ # 13 ]  
Jr. Member
Avatar
RankRank
Total Posts:  56
Joined  2008-08-18

James,

The DB export will not work, as it creates INSERT statements. You need to dump the table in a CSV format to the file for import into ICE.

I will see if I can get you the exact command format to do that.

Cheers -V

Signature 
Profile
 
Posted: 01 December 2008 12:17 PM   Ignore ]   [ # 14 ]  
Newbie
Avatar
Rank
Total Posts:  16
Joined  2008-10-20

SELECT ... INTO OUTFILE is the complement of LOAD DATA INFILE; and is intended to let you very quickly dump a table to a text file. The syntax for the export_options part of the statement consists of the same FIELDS and LINES clauses that are used with the LOAD DATA INFILE statement.

To export data, use the following MySQL command:
SELECT … INTO OUTFILE ‘file_name’
[FIELDS
[TERMINATED BY ‘string’]
[ENCLOSED BY ‘char’]
[ESCAPED ‘char’]]
FROM ‘tbl_name’;
where:
file_name = path to the file where data will be exported

To import data into a Infobright table, use the following MySQL loading command:
LOAD DATA INFILE ‘file_name’ INTO TABLE tbl_name
[FIELDS
[TERMINATED BY ‘char’]
[ENCLOSED BY ‘char’]
[ESCAPED ‘char’]
];
where:
file_name = path to the file to be loaded
tbl_name = name of the table where the data will be loaded

The following example may help;

select * from my_table into outfile ‘/tmp/my_table.txt’ fields terminated by ‘|’;
load data infile ‘tmp/my_table.txt’ into table my_table fields terminated by ‘|’;

Profile