VLDB, Very Large DataBases


VLDB, Very Large DataBases:  Welcome to the discussion on VLDB, to present and discuss VLDB regarding it’s implementation, application, design and architecture needs.  The focus here will be centered on the techniques necessary to setup, maintain, and grow a database of 1 to 80 Terabytes in nature.  The target audience is those who have a background in systems architecture, database administration, or systems administration.  The business benefits advantages and dis advantages of VLDB from a business perspective.

The following goals of understanding:
• Impacts of VLDB on Hardware and Software
• Sizing and scooping of Hardware
• Parallel Processing, multi-threaded Architecture in relation to VLDB

The other areas of discussion will focus on: networks, partitions, clustering files and indexes, database sharing, CPU speed and quantity, RAM speed and quantity, and an introduction to MPP possibilities (such as NUMA-Q).  A new series will introduce new structures capable of assisting in housing VLDB data sets: a structure called a Data Collection, and a new data architecture model called the Data Vault™ will be presented as a means for massive data storage in a detailed normalized form.  Watch for this series in the coming months.
Much of the theories and hypothesis presented here are supported by mathematical functions, however mathematical theory is not the focus of this series; therefore the mathematical functions may be referenced in certain places.  However the proof of these theorem’s or math functions will not be demonstrated.

What is VLDB? VLDW?
Different definitions exist of each term, and the terms themselves conjure different images for everyone.  However, the standard definitions in the industry are as follows:

VLDB = Very Large DataBase, an environment or storage space managed by a relational database management system (RDBMS) consisting of vast quantities of information.  The definition of what exactly a VLDB is changes every day as hardware and software adapt, become faster and faster – and are capable of handling more and more loads.  Usually though, VLDB’s “don’t fit on laptops”. 

VLDW = Very Large Data Warehouse.  A data warehouse can be made up of both physical (hardware/software), and virtual (electronic information) content.  The warehouse being the structure or architecture serving a specific business purpose (or several).  The VLDW is based on VLDB technology.  The two usually go hand in hand.

Technical BIG BANG Theory
The database or data warehouse will grow in the specific areas consisting of transactional information.  The information describing the transaction itself is usually not as frequently changing, and is capable of being termed: Meta Data – or data about data.  Most of the transactions around the world consist of: date and time (when did the transaction occur?), who (the individual conducting, initiating, or entering the transaction), and what (the content of the transaction itself). 

Beyond these three components, the transactions typically include many descriptive elements, and relationships between information – all of which must be recognized, but should be stored in a minimalist fashion in the database.  Therefore, compression ratios, growth ratio’s, and metrics ranging from implementation time, to overall sizing should be computable attributes following a mostly accurate profiling activity.  The keys are: how many types of transactions and how many total transactions are being entered in to the system?

The theory goes on to say: the relationship between hardware, software, speed, and RDBMS, and data set growth appears to be closely related – as well as consistent in nature.  Therefore most common techniques included in mathematical algorithms should hold true for splitting data in to chunks while holding other elements constant.  As a result, speed should only change if one of the other variables change.

The suggested very broad formula for this theory might be: speed = [(hardware)*(software)] bytes per second / (size of data in bytes)

With this formula, if the size of the data increases, while the hardware and software stay constant, the speed decreases.  Or vise versa if the hardware increases and the size of the data stay the same the speed should increase.  Of course the dependency is on the bottleneck of the system.  A delicate balance is achieved in keeping these three “points” moving in ever larger directions in equal portions.  Think of three sides of a triangle:

 

The pressure is in the middle of the triangle, pushing outward.  Speed is the size of the arrow set inside the triangle.  Increasing one, doesn’t necessarily increase speed.  A number of factors must be addressed.  If you picture a pyramid, the 3d dimension might be the number of parallel processes capable of executing the requests.  Consider the volume of a pyramid as the resulting speed, the more volume, potentially the more speed for more data. 

Hardware = Size + Speed

  1. Size = Amount of RAM, # of parallel CPU’s, # of controllers, # of disks
  2. Speed = RAM speed, Disk Speed, Bus Speed, Network Speed (bandwidth), and removing the # of idle cycles per second per CPU, as well as the # of blocked I/O processes per second per CPU.

Software = # of parallel processes + # of partitions + operating speed of software (throughput)

If the formula holds true, then there are a number of things that can be done to achieve better performance.

It is all of this important?

We are seeing rapid change in both technology and the requirements for analysis of massive sets of information.  More and more businesses are looking at the related issues and recognizing that patterns, trends, and analysis of these items is becoming critical to the success of the business.  Ok – enough business talk.  What about the technical end of it?  Because of the business, the IT specialists left to implement these complex problems are beginning to deal with physical problems they’ve never had to think about before.  OLTP simply “deleted” or rolled off the unused portion of the data, Data Warehouses physically were unable (until recently) to hold such massive sets of information.  The focus has simply changed, the name of the game for IT is to get the most accurate answer from a compilation of as much information as possible – and do it as quickly as possible, providing that much desired competitive edge in business.

In recognizing speed as a foremost goal, we begin to realize that one of IT’s responsibilities is integrating, understanding, and tuning all of these different components.  Not to mention the mountains of data those exist within an enterprise.  Technically, having a VLDB situation without the expertise to predict it, manage it, grow it, or architect it – is like having the information aboard the best aircraft carrier in the world without having the aircraft.  What’s the point?  More simply put – it’s like a rowboat without the oars.

More and more IT personnel need to understand the varying levels in order to tune the correct component.  This is analogous to improving a car with the best carburetor money can buy, but then putting the same old fuel in to the engine, maybe the engine is too small?  Maybe the fuel is bad?  Maybe there isn’t enough horsepower to take advantage of the best carburetor?  Anyhow, by understanding only a single component of this particular triangle, only certain levels of performance and management of VLDB can be achieved.  Provided the amount of money it costs a business to house, build and deploy VLDB this would be a grave mistake on the business’s part to not properly train IT employees.

So where does one go to achieve such knowledge?  They don’t teach it, and it’s something that’s not readily available on the web.  This series will try to assist in unraveling some of the complications and necessities of what to think about when getting in to VLDB for the first time.  For those advanced with VLDB, please send any comments, omissions, errors, or thoughts to the author directly.  We always welcome input.  This series is meant to be dynamic in nature.

Whose involved in VLDB?

Now that the discussion has opened, who is actually involved in VLDB?  Who needs to know?  What’s the weakest link?  Ok, sorry for the bad pun.  We’re just checking to see if you’re awake.  The answer is: everyone.  Everyone who interacts from a business perspective is interested in VLDB even if they don’t know it.  They need to understand that the questions they are beginning to ask about the content (specifically the historical patterns of the content) require more and more analysis.  Which in turn, most statisticians will tell you that to predict for 6 months (most accurately), should have at least 12 months behind it.  In other words: twice as much data for half the amount of prediction over time.  All of these statistics prove true when coming to most accurate conclusions in VLDB.  Sure sample sets can be implemented, but what if that key factor is missed?  There are examples of these written up in various books such as Oracle8i – DBA Handbook, and some other sparsely populated writings.

From an IT perspective, everybody that touches, maintains, architects/designs, or builds a data warehouse is interested (or should be) in VLDB.  The list of roles include: systems architect, systems administrator, enterprise architect, database administrator, ETL operator, dss/olap (Business Intelligence individual), operations for backup/restore and disaster recovery.

Nästa kommer att omfatta begreppen VLDB såsom parallell bearbetning, arkitekturer och introduktioner till hårdvara / mjukvara krav.

VLDB / VLDW

the concepts behind VLDB and VLDW.  What to think about when creating these situations or considering them for implementation practices.  There are many concepts regarding VLDB, which due to time and space constraints will not be covered here.  If there is a desire to see a write up on certain concepts pertaining to VLDB that has been left out of this series please feel free to contact Core Integration Partners and let us know. 

The concepts included in this discussion focus on the hardware, software, and architecture sides of VLDB.  There are separate chapters on both hardware and software included in this series.  For that reason these concepts are interwoven with all such ideas, and should be read as precursory knowledge before launching in to the other respective areas. 

Architecture Overview

Architecture plays a much larger role in VLDB than most care to think about.  In order to understand VLDB/VLDW we must consider the architecture and the concepts behind what makes it all work.  The architecture is like the set of blueprints for a house, without the overview of what to put where, and the guidelines or steps to build it – it becomes difficult to create correctly.  All too often most projects begin with the right intentions, but forget to include the architecture components.  This leaves the foundations of VLDB extremely weak, and many problems begin to crop up as data set size grows rapidly.

What is a thread?

A thread is typically defined as an independent process which can execute asynchronous to other processes.  Usually a thread is capable of operating independently of other threads, or processes.  When a thread has to wait on the results of another thread, it will sit in “idle mode” until a token is passed (semaphore/mutex).  Ok, in English – a thread can be likened to a car.  Put four cars coming to a stop at a 4 way stop sign, each car is running independently of the other cars.  They all arrive at the stop at different times.  Then each car proceeds to go in turn and continue on their way.  Think of each car being a thread, the stop sign being a semaphore or shared lock, by which all cars must stop first before proceeding, and each car can only proceed when the intersection is cleared.

Threads can also be thought of in terms of different users connecting to the database.  Each has their own connection, and each can execute different queries at the same time, or they can execute the same query at the same time.  The threads allow for multiple processes to occur at the same time.  This concept is very important in understanding how multi-threaded architectures work.  A multi-threaded architecture provides the system with the ability to run multiple threads or processes at the same time.  The system performs management of the 4 way stop (with stop lights), so that accidents and traffic jams are avoided as best as possible.


Varje bil är en tråd. Alla Bilar måste sluta därför trafikljuset, men alla BIL är i rörelse oberoende av varandra.

Why is this important?  It is important to recognize threads as a basis of architecture for systems with VLDB, so that larger tasks or requests (such as massive inserts, selects, deletes or updates) can be split in to smaller multiple tasks that can run at the same time – thus shortening the overall time frame necessary to complete the task.  Going back to the analogy used earlier, it’s similar to having N# forklifts.  Each thread is a forklift, the operators (people) are like the system that communicates and manages what each thread is doing.  It can take lots of little threads working together to complete what otherwise is a huge task, in a short period of time.  However, without the Operating System providing parallel processing capabilities, threads still execute in order (serially).  It would be like having five forklifts, but only one key that fits all, you can only operate one forklift at a time.

The time line increases, for serial execution are multiples longer than that of parallel execution.  The parallel threads all executetTogether at the same time, while the serial threads execute one after the other.

What is Parallel Processing?

Parallel Processing is the ability of the system to run, manage, and maintain multiple threads at the same time (synchronously).  It’s the five keys and five operators needed to run each forklift independently.  Degree of parallelism is a measure which indicates how many parallel processes can run at once before the system is overloaded or maximized.  For instance, if you have five stacks of lumber, and seven forklifts, you can still only operate five forklifts at a time, indicating a degree of parallelism of five.  This is a very simplistic explanation of the parallel processing theories.  There are many books in the marketplace that describe in great detail how parallel processing and degree of parallelism is reached.  It is not the scope of this series to discuss each in extreme detail, only to introduce the concepts as they relate to VLDB.

Parallel processing enables speed, and division of tasks across CPU’s.  This is also where the concept of load balancing comes in to play.  Load balancing is the process by which the operating system decides where to run the threads, how many threads to run, and which ones need to sit idle.  It attempts to maximize the overall usage of the hardware resources available.  Load balancing in the lumberyard would consist of deciding which forklifts are out of gas and need to sit idle.  Maybe one of the forklifts can carry a larger capacity load than the other, so it may be used to carry a slightly larger pile in a shorter time frame.  Maybe it carries two smaller piles more quickly to the other side of the yard.  Load balancing is dynamic in nature.  Threads are swapped to and from different CPU’s unless they are “CPU BOUND” during execution.  CPU Bound means that the threads have been tied to run on a specific CPU, and only that CPU.  According to the load balancer, cannot be moved to run on another CPU.  Very few programs allow this level of control by the designer or operator.  This is a double-edged sword, and can be tremendously beneficial, or extremely dangerous depending on how it’s set up.

Four-Way Intersection

In this case, the stoplights represent the Operating System, and the capacities for executing or managing parallel processes.  The threads (BIL) are the processes.

Why is parallel processing important to VLDB?  It allows multiple processes (SQL queries for instance) to execute at the exact same time frame, and with the same priority.  It also provides the vehicle for very large tasks to be split in to multiple smaller tasks for the purposes of faster execution times.  Without parallel processing, it becomes difficult if not impossible to handle systems of scale.  Parallel processing also plays a huge role in the success of partitioning the data, which in the example provided here – are the stacks of lumber being cut in to shorter stacks.   Now that threads exist, and parallel processes exist to manage threads, how does the system decide who takes priority and when?  The answer is through a technique called pre-emptive multi-tasking.

What is pre-emptive multi-tasking?
Pre-emptive multi-tasking is the ability to handle parallel processes, but to grant priorities to certain processes so they may interrupt other processes at specific points in time, or in reaction to specific events.  In the case of the Four-way intersection with stoplights, a pre-emptive task would be an ambulance, which changes all the lights to red, but is still (in most cases) able to go through the intersection.  It’s a random occurrence that’s unpredictable, but when it happens, it takes priority over the cars on the road.  The multi-tasking part is parallel processing, it allows multiple tasks to interrupt other tasks at the same time – for instance, three ambulances at three different intersections, all have the same effect and the same priority on the road.  Hopefully not all the ambulances head towards the same intersection at the same time – this would cause collision, and the ambulances would have to slow down, stop and wait until each one makes it through the intersection.  This concept is called Blocked-I/O or blocked processing, and it will be discussed later in this chapter.

Why is pre-emptive multi-tasking important?  What does it mean to the VLDB world, and how does it play in the operating system?  It’s important because it allows the operating system to manage execution priorities.  For instance, if you type a key on your keyboard  the system must respond immediately, regardless of what it’s doing.  This is an example of a hardware interrupt, one of the highest priority tasks that can occur on a system.  It allows the system to react to a series of unpredictable events at the time they happen.  The system considers it less important to update the screen than to respond to a keyboard event.  These priorities also run through different software in the operating system.  In fact, in near real time systems or true pre-emptive multi-tasking systems such as Unix, these priorities can even be assigned.  Unix term for this is “niceness”.  Setting the “nice” for software sets the priority of it’s interrupt levels.

For instance, you can tell the system what’s an ambulance, versus a fire-truck versus a car, etc..  This is important, to be able to set the priority in VLDB means that the database software can take precedence over other software running on the same system.  Once priority is decided, the software makes the request to the operating system that enters a queue.  When the system has time, or needs to execute the request it pulls it from the queue, assigns it to a processor, then it begins executing.  Of course there are many more steps to this process which have not been described here but that’s the general flow.

Unfortunately there are still systems out there that claim to be multi-threaded.  This may be true, but when it comes down to brass tacks they are not true pre-emptive multi-tasking.  These systems block processes from executing on an order of magnitude more frequently than those systems with true pre-emptive multi-tasking.  It’s one of the primary differences between Unix, and Microsoft’s Windows platforms.

What is a blocked process?
A blocked process is a thread that is sitting in idle mode, waiting for another of its requests to complete.  It can be likened to going to a gas station, finding out they are filling the gas tanks below the surface – so you have to wait maybe five minutes before you can fill up your car (when they turn the pumps back on).  But, before you can get gas, these events such as turning the pumps back on must happen first.  This sequencing of events causes you to wait.  This is equivalent to “idle state” for the thread that was executing.

The thread sitting in idle, waiting for something else to happen or complete is called a blocked process or blocked thread.  Blocked I/O simply means that the thread is waiting for a disk operation to complete which is the most common event to block threads.  There are many different methods for threads or processes to become blocked, if you want to find out more about these things you can find books on multi-threaded processing which walk through all the details.

Are blocked processes bad?  Not necessarily, unless there are too many processes waiting in the wings to execute.  Sometimes multiple threads are waiting on each other, this could be equated to deadlock situation in a database.  Deadlocks cause machines to freeze up – frequently requiring cold boots.  Blocked processes are a problem for the resource monitors, because the CPU utilization rates drop during processing while the CPU suspends the thread in “idle” or wait state until the blocked lock releases and continues to execute.  The problem is, the actual utilization of the CPU may drop, but 100% of all available resources are taken, causing CPU load to be significantly higher than utilization rates.

The less blocking that the processes or threads do, the faster the execution of the overall processing.  The problem is, no matter what’s done, there will always be interrupting threads with higher priorities, some of which require one or more lower priority processes to become blocked.  Why is this important to VLDB?  When dealing with massive sets of information, it is most optimal to have the process broken up in to smaller more manageable processes, but also to have them be able to be independently executed.  When the dependencies from the separate threads have been removed they can execute in parallel and have less “blocked” action.  As long as each thread is going after independent data sets on disk, again this leads to partitioning the large data set in the database.  If you have four threads going after the same data, then you’ll end up with dueling threads.  It would be similar to having four forklifts trying to lift the same stack of lumber at the same time.  It makes no sense.

What are pipes and why do they matter?
Pipes are conceptual term used for throughput – at least in this document.  The pipes are the ability of the system to perform movement of X amount of information in Y time frame through a series of circuits or connectivity.  Ok – what was that again?   Basically with the lumber yard, it may be the number of stacks or total amount of lumber that a delivery truck can carry over the course of a single tank of gas.  It’s the transport mechanism for the information inside the system.  Without proper sizing of the transport mechanism the job takes too long, costs too much, or produces too much waste or excess.  In technical terms, that relates to over-utilization of available resources, under-utilization of available resources, or too tight a time frame in which to move too much information.

Just like everything else, pipes come in all shapes, sizes, and colors.  Virtual pipes can have any attribute you wish to assign to it.  Anyhow the throughput of these pipes is what matters most.  How fast can a pre-determined amount of information travel from point a to point b?  It may be a network connection, or disk connection, or disk controller, bus speed on the CPU board.  It could even represent the speed of the CPU – yes it’s all about speed and performance.

In this case, the stoplights represent the Operating System, and the capacities for executing or managing parallel processes.  The threads (cars) are the processes.

Why is parallel processing important to VLDB?  It allows multiple processes (SQL queries for instance) to execute at the exact same time frame, and with the same priority.  It also provides the vehicle for very large tasks to be split in to multiple smaller tasks for the purposes of faster execution times.  Without parallel processing, it becomes difficult if not impossible to handle systems of scale.  Parallel processing also plays a huge role in the success of partitioning the data, which in the example provided here – are the stacks of lumber being cut in to shorter stacks.  This concept of partitioning will be explained in one of the follow on chapters.

Ok, so now that threads exist, and parallel processes exist to manage threads, how does the system decide who takes priority and when?  The answer is through a technique called pre-emptive multi-tasking.

What is pre-emptive multi-tasking?
Pre-emptive multi-tasking is the ability to handle parallel processes, but to grant priorities to certain processes so they may interrupt other processes at specific points in time, or in reaction to specific events.  In the case of the Four-way intersection with stoplights, a pre-emptive task would be an ambulance, which changes all the lights to red, but is still (in most cases) able to go through the intersection.  It’s a random occurrence that’s unpredictable, but when it happens, it takes priority over the cars on the road.  The multi-tasking part is parallel processing, it allows multiple tasks to interrupt other tasks at the same time – for instance, three ambulances at three different intersections, all have the same effect and the same priority on the road.  Hopefully not all the ambulances head towards the same intersection at the same time – this would cause collision, and the ambulances would have to slow down, stop and wait until each one makes it through the intersection.  This concept is called Blocked-I/O or blocked processing, and it will be discussed later in this chapter. 

Why is pre-emptive multi-tasking important?  

What does it mean to the VLDB, It’s important because it allows the operating system to manage execution priorities.  For instance, if you type a key on your keyboard  the system must respond immediately, regardless of what it’s doing.  This is an example of a hardware interrupt, one of the highest priority tasks that can occur on a system.  It allows the system to react to a series of unpredictable events at the time they happen.  The system considers it less important to update the screen than to respond to a keyboard event.  These priorities also run through different software in the operating system.  In fact, in near real time systems or true pre-emptive multi-tasking systems such as Unix, these priorities can even be assigned.  Unix term for this is “niceness”.  Setting the “nice” for software sets the priority of it’s interrupt levels.

For instance, you can tell the system what’s an ambulance, versus a fire-truck versus a car, etc..  This is important, to be able to set the priority in VLDB means that the database software can take precedence over other software running on the same system.  Once priority is decided, the software makes the request to the operating system that enters a queue.  When the system has time, or needs to execute the request it pulls it from the queue, assigns it to a processor, then it begins executing.  Of course there are many more steps to this process which have not been described here but that’s the general flow.

Unfortunately there are still systems out there that claim to be multi-threaded.  This may be true, but when it comes down to brass tacks they are not true pre-emptive multi-tasking.  These systems block processes from executing on an order of magnitude more frequently than those systems with true pre-emptive multi-tasking.  It’s one of the primary differences between Unix, and Microsoft’s Windows platforms.

What is a blocked process?
A blocked process is a thread that is sitting in idle mode, waiting for another of its requests to complete.  It can be likened to going to a gas station, finding out they are filling the gas tanks below the surface – so you have to wait maybe five minutes before you can fill up your car (when they turn the pumps back on).  But, before you can get gas, these events such as turning the pumps back on must happen first.  This sequencing of events causes you to wait.  This is equivalent to “idle state” for the thread that was executing. 

The thread sitting in idle, waiting for something else to happen or complete is called a blocked process or blocked thread.  Blocked I/O simply means that the thread is waiting for a disk operation to complete which is the most common event to block threads.  There are many different methods for threads or processes to become blocked, if you want to find out more about these things you can find books on multi-threaded processing which walk through all the details.

Are blocked processes bad?  Not necessarily, unless there are too many processes waiting in the wings to execute. 

Sometimes multiple threads are waiting on each other, this could be equated to deadlock situation in a database.  Deadlocks cause machines to freeze up – frequently requiring cold boots.  Blocked processes are a problem for the resource monitors, because the CPU utilization rates drop during processing while the CPU suspends the thread in “idle” or wait state until the blocked lock releases and continues to execute.  The problem is, the actual utilization of the CPU may drop, but 100% of all available resources are taken, causing CPU load to be significantly higher than utilization rates.

The less blocking that the processes or threads do, the faster the execution of the overall processing.  The problem is, no matter what’s done, there will always be interrupting threads with higher priorities, some of which require one or more lower priority processes to become blocked. 

Why is this important to VLDB?  When dealing with massive sets of information, it is most optimal to have the process broken up in to smaller more manageable processes, but also to have them be able to be independently executed.  When the dependencies from the separate threads have been removed they can execute in parallel and have less “blocked” action.  As long as each thread is going after independent data sets on disk, again this leads to partitioning the large data set in the database.  If you have four threads going after the same data, then you’ll end up with dueling threads.  It would be similar to having four forklifts trying to lift the same stack of lumber at the same time.  It makes no sense.

What are pipes and why do they matter?
Pipes are conceptual term used for throughput – at least in this document.  The pipes are the ability of the system to perform movement of X amount of information in Y time frame through a series of circuits or connectivity.  Ok – what was that again?   Basically with the lumber yard, it may be the number of stacks or total amount of lumber that a delivery truck can carry over the course of a single tank of gas.  It’s the transport mechanism for the information inside the system.  Without proper sizing of the transport mechanism the job takes too long, costs too much, or produces too much waste or excess.  In technical terms, that relates to over-utilization of available resources, under-utilization of available resources, or too tight a time frame in which to move too much information.

Just like everything else, pipes come in all shapes, sizes, and colors.  Virtual pipes can have any attribute you wish to assign to it.  Anyhow the throughput of these pipes is what matters most.  How fast can a pre-determined amount of information travel from point a to point b?  It may be a network connection, or disk connection, or disk controller, bus speed on the CPU board.  It could even represent the speed of the CPU – yes it’s all about speed and performance.

Khan – MCTS  www.addarr.com

Advertisements