seduling algo

75
7/23/2019 seduling algo http://slidepdf.com/reader/full/seduling-algo 1/75 ©S i l be r s c ha t z , Ko r t h a nd Suda r s ha n 1 2 . 1 Da t a b a s e Sy s te m Conc e p t s Cha p t e r 12 : Inde x i ng and Ha sh i ng Cha p t e r 12 : Inde x i ng and Ha sh i ng Ba s i c Conc e p t s Or de r ed Ind i c es B+ - T r ee I ndex F i l es B - T r ee I ndex F i l es S t a t i c Has hi ng Dyn a mi c Ha s h i ng Compa r i s on o f Or de r ed Inde x i ng and Ha s h i ng I nde x De n i t i on i n SQL Mu l t i p l e - Key Ac ces s

Upload: jitendradausa

Post on 18-Feb-2018

227 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 1/75

©Silberschatz, Korth and Sudarshan12.1Database System Concepts

Chapter 12: Indexing and HashingChapter 12: Indexing and Hashing

Basic Concepts

Ordered IndicesB+-Tree Index Files

B-Tree Index Files

Static Hashing

Dynamic Hashing

Comparison of Ordered Indexing and Hashing

Index Denition in SQL

Multiple-Key Access

Page 2: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 2/75

©Silberschatz, Korth and Sudarshan12.2Database System Concepts

Basic ConceptsBasic Concepts

Indexing mechanisms u sed to speed up access t o desired data.E.g., author catalog in library

Search Key - attribute to set of attributes used to look uprecords i n a le.

An index le consists of records (called index e ntries ) of theform

Index les are typically much smaller than the original le

Two basic k inds o f indices:

Ordered indices: search keys are stored in sorted orderHash indices: search keys a re distributed uniformly a cross“buckets” using a “hash function”.

search-key pointer

Page 3: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 3/75

©Silberschatz, Korth and Sudarshan12.3Database System Concepts

Index Evaluation MetricsIndex Evaluation Metrics

Access t ypes su pported efficiently. E .g.,

records with a specied value in the attributeor records with an attribute value falling in a specied range ofvalues.

Access t ime

Insertion timeDeletion time

Space overhead

Page 4: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 4/75

©Silberschatz, Korth and Sudarshan12.4Database System Concepts

Ordered IndicesOrdered Indices

In an ordered index , index e ntries a re stored sorted on thesearch key value. E.g., author catalog in library.

Primary index : in a sequentially ordered le, the index whosesearch key s pecies t he sequential order of the le.

Also called clustering index

The search key o f a primary index is usually but not necessarily theprimary ke y.

Secondary index : an index whose search key sp ecies an orderdifferent from the sequential order of the le. Also callednon-clustering index .

Index-sequential le : ordered sequential le with a primary index.

Indexing techniques e valuated on basis of:

Page 5: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 5/75

©Silberschatz, Korth and Sudarshan12.5Database System Concepts

Dense Index FilesDense Index Files

Dense index — Index record appears for every se arch-key va lue

in the le.

Page 6: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 6/75

©Silberschatz, Korth and Sudarshan12.6Database System Concepts

Sparse Index FilesSparse Index Files

Sparse Index : contains index records for only so me search-key

values.Applicable when records a re sequentially ordered on search-key

To locate a record with search-key va lue K we:Find index record with largest search-key value < K

Search le sequentially s tarting at the record to which the indexrecord points

Less sp ace and less m aintenance overhead for insertions a nddeletions.

Generally s lower than dense index for locating records.

Good tradeoff: sparse index with an index e ntry for every block inle, corresponding to least search-key value in the block.

Page 7: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 7/75

©Silberschatz, Korth and Sudarshan12.7Database System Concepts

Example of Sparse Index FilesExample of Sparse Index Files

Page 8: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 8/75

©Silberschatz, Korth and Sudarshan12.8Database System Concepts

Multilevel IndexMultilevel Index

If primary index does n ot t in memory, access b ecomes

expensive.To reduce number of disk a ccesses t o index records, treatprimary index kept on disk as a sequential le and construct asparse index o n it.

outer index – a sparse index of primary index

inner index – the primary index le

If even outer index is t oo large to t in main memory, yet anotherlevel of index can be created, and so on.

Indices at all levels must be updated on insertion or deletion from

the le.

Page 9: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 9/75

©Silberschatz, Korth and Sudarshan12.9Database System Concepts

Multilevel Index (Cont.)Multilevel Index (Cont.)

Page 10: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 10/75

©Silberschatz, Korth and Sudarshan12.10Database System Concepts

Index Update: DeletionIndex Update: Deletion

If deleted record was the only record in the le with its particular

search-key value, the search-key is deleted from the index also.Single-level index deletion:

Dense indices – deletion of search-key is s imilar to le recorddeletion.

Sparse indices – if an entry for the search key e xists i n the index, itis deleted by replacing the entry in the index with the next search-key value in the le (in search-key o rder). If the next search-keyvalue already has an index entry, the entry is deleted instead ofbeing replaced.

Page 11: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 11/75

©Silberschatz, Korth and Sudarshan12.11Database System Concepts

Index Update: InsertionIndex Update: Insertion

Single-level index insertion:

Perform a lookup using the search-key va lue appearing in the recordto be inserted.

Dense indices – if the search-key va lue does n ot appear in theindex, insert it.

Sparse indices – if index s tores a n entry for each block of the le, no

change n eeds to be m ade to the index unless a new block iscreated. In this ca se, the rst search-key va lue appearing in the newblock is inserted into the index.

Multilevel insertion (as well as deletion) algorithms are simpleextensions of the single-level algorithms

Page 12: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 12/75

©Silberschatz, Korth and Sudarshan12.12Database System Concepts

Secondary IndicesSecondary Indices

Frequently, one wants t o nd all the records w hosevalues in a certain eld (which is not the search-key ofthe primary index satisfy s ome condition.

Example 1: In the account database stored sequentiallyby a ccount number, we may want to nd all accounts in aparticular branch

Example 2: as a bove, but where we want to nd allaccounts w ith a specied balance or range of balances

We can have a secondary index with an index recordfor each search-key value; index record points to abucket that contains pointers to all the actual recordswith that particular search-key value.

f

Page 13: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 13/75

©Silberschatz, Korth and Sudarshan12.13Database System Concepts

Secondary Index onSecondary Index on balancebalance eld ofeld ofaccountaccount

Page 14: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 14/75

©Silberschatz, Korth and Sudarshan12.14Database System Concepts

Primary and Secondary IndicesPrimary and Secondary Indices

Secondary indices have to be dense.

Indices offer substantial benets w hen searching for records.When a le is m odied, every index o n the le must be updated,Updating indices imposes o verhead on database modication.

Sequential scan using primary index is e fficient, but a sequential

scan using a secondary index is e xpensiveeach record access m ay fetch a new block from disk

Page 15: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 15/75

©Silberschatz, Korth and Sudarshan12.15Database System Concepts

BB ++-Tree Index Files-Tree Index Files

Disadvantage of indexed-sequential les: performancedegrades a s le grows, since many overow blocks g etcreated. Periodic reorganization of entire le is r equired.

Advantage of B +-tree index les: a utomatically reorganizes

itself with small, local, changes, in the face of insertions anddeletions. Reorganization of entire le is not required tomaintain performance.

Disadvantage of B +-trees: extra insertion and deletionoverhead, space overhead.

Advantages o f B +-trees o utweigh disadvantages, and they a reused e xtensively.

B+-tree indices are an alternative to indexed-sequential les.

Page 16: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 16/75

©Silberschatz, Korth and Sudarshan12.16Database System Concepts

BB ++-Tree Index Files (Cont.)-Tree Index Files (Cont.)

All paths from root to leaf are of the same length

Each node that is n ot a root or a leaf has b etween [ n /2] andn c hildren.

A leaf node has b etween [( n–1)/2] and n–1 va luesSpecial cases:

If the root is not a leaf, it has at least 2 children.

If the root is a leaf (that is, there are no other nodes in thetree), it can have between 0 and ( n–1) values.

A B+

-tree is a rooted tree satisfying the following properties:

Page 17: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 17/75

©Silberschatz, Korth and Sudarshan12.17Database System Concepts

BB ++-Tree Node Structure-Tree Node Structure

Typical node

Ki a re the search-key va lues

P i a re pointers t o children (for non-leaf nodes) or pointers t o records

or buckets o f records ( for leaf nodes).The search-keys in a node are ordered

K1 < K2 < K3 < . . . < Kn– 1

Page 18: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 18/75

©Silberschatz, Korth and Sudarshan12.18Database System Concepts

Leaf Nodes in BLeaf Nodes in B ++-Trees -Trees

For i = 1, 2, . . ., n– 1, pointer P i e ither points to a le record withsearch-key value Ki, or to a bucket of pointers to le records,each record having search-key va lue Ki. Only need bucketstructure if search-key d oes not form a primary key.

If Li, L j are leaf nodes a nd i < j, L i’s s earch-key va lues a re lessthan L j ’s s earch-key va luesP n points to next leaf node in search-key order

Properties of a leaf node:

Page 19: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 19/75

©Silberschatz, Korth and Sudarshan12.19Database System Concepts

Non-Leaf Nodes in BNon-Leaf Nodes in B ++-Trees -Trees

Non leaf nodes f orm a multi-level sparse index on the leaf nodes.

For a non-leaf node with m p ointers:All the search-keys in the subtree to which P 1 points are less than K1

For 2 ≤ i ≤ n – 1, all the search-keys in the subtree to which P i pointshave values g reater than or equal to Ki–1 and less t han Km–1

Page 20: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 20/75

©Silberschatz, Korth and Sudarshan12.20Database System Concepts

Example of a BExample of a B ++-tree-tree

B+-tree for account le ( n = 3)

Page 21: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 21/75

©Silberschatz, Korth and Sudarshan12.21Database System Concepts

Example of BExample of B ++-tree-tree

Leaf nodes must have between 2 and 4 values( (n–1)/2 and n –1, with n = 5).

Non-leaf nodes o ther than root must have between 3and 5 children ( (n /2 and n with n =5).

Root must have at least 2 children.

B+-tree for account le ( n = 5)

Page 22: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 22/75

©Silberschatz, Korth and Sudarshan12.22Database System Concepts

Observations ab out BObservations ab out B ++-trees-trees

Since the inter-node connections a re done by p ointers, “logically”

close blocks need not be “physically” close.The non-leaf levels o f the B +-tree form a hierarchy o f sparseindices.

The B +-tree contains a relatively s mall number of levels(logarithmic in the size of the main le), thus searches can beconducted efficiently.Insertions a nd deletions t o the main le can be handledefficiently, as t he index c an be restructured in logarithmic t ime(as w e shall see).

Page 23: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 23/75

©Silberschatz, Korth and Sudarshan12.23Database System Concepts

Queries o n BQueries o n B ++-Trees -Trees

Find all records w ith a search-key value of k.1. Start with the root node

1. Examine the node for the smallest search-key value > k.

2. If such a value exists, assume it is K j. Then follow P i to thechild node

3. Otherwise k ≥ Km–1 , where there are m pointers i n the node.

Then follow P m to the child node.2. If the node reached by following the pointer above is not a leaf

node, repeat the above procedure on the node, and follow thecorresponding pointer.

3. Eventually reach a leaf node. If for some i, key Ki = k follow

pointer P i to the desired record or bucket. Else no record withsearch-key va lue k exists.

Page 24: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 24/75

©Silberschatz, Korth and Sudarshan12.24Database System Concepts

Queries o n BQueries o n B +-+- Trees (Cont.)Trees (Cont.)

In processing a query, a path is t raversed in the tree fromthe root to some leaf node.

If there are K se arch-key values in the le, the path is nolonger than log n /2 (K) .

A node is g enerally the same size as a disk b lock,typically 4 kilobytes, and n is t ypically a round 100 (40

bytes per index entry).With 1 million search key va lues a nd n = 100, at mostlog 50 (1,000,000) = 4 nodes a re accessed in a lookup.

Contrast this with a balanced binary free with 1 millionsearch key v alues — around 20 nodes are a ccessed in alookup

above difference is si gnicant since every node a ccess m ayneed a disk I /O, costing around 20 milliseconds!

Page 25: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 25/75

©Silberschatz, Korth and Sudarshan12.25Database System Concepts

Updates on BUpdates on B ++-Trees: Insertion-Trees: Insertion

Find the leaf node in which the search-key value would appear

If the search-key value is already there in the leaf node, record isadded to le and if necessary a pointer is inserted into thebucket.

If the search-key value is n ot there, then add the record to themain le and create a b ucket if necessary. T hen:

If there is room in the leaf node, insert (key-value, pointer) pair in theleaf node

Otherwise, split the node (along with the new (key-value, pointer)entry) as d iscussed in the next slide.

Page 26: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 26/75

©Silberschatz, Korth and Sudarshan12.26Database System Concepts

Updates on BUpdates on B ++-Trees: Insertion (Cont.)-Trees: Insertion (Cont.)

Splitting a node:

take the n(search-key value, pointer) pairs ( including the one beinginserted) in sorted order. Place the rst n /2 in the original node,and the rest in a new node.

let the new node be p, a nd let k be the least key value in p. Insert(k,p ) in the parent of the node being split. If the parent is full, split it

and propagate the split further up.The splitting of nodes proceeds upwards till a node that is not fullis found. In the worst case the root node may be split increasingthe height of the tree by 1.

Result of splitting node containing Brighton and Downtown on inserting Clearview

Page 27: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 27/75

©Silberschatz, Korth and Sudarshan12.27Database System Concepts

Updates on BUpdates on B ++-Trees: Insertion (Cont.)-Trees: Insertion (Cont.)

B+-Tree before and after insertion of “Clearview”

Page 28: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 28/75

©Silberschatz, Korth and Sudarshan12.28Database System Concepts

Updates on BUpdates on B ++-Trees: Deletion-Trees: Deletion

Find the record to be deleted, and remove it from the

main le and from the bucket (if present)Remove (search-key va lue, pointer) from the leaf node ifthere is n o bucket or if the bucket has b ecome empty

If the node has t oo few entries d ue to the removal, andthe entries in the node and a sibling t into a single

node, thenInsert all the search-key values i n the two nodes i nto asingle node (the one on the left), and delete the other node.

Delete the pair ( Ki– 1, P i ), where P i is the pointer to thedeleted node, from its p arent, recursively u sing the above

procedure.

Page 29: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 29/75

©Silberschatz, Korth and Sudarshan12.29Database System Concepts

Updates on BUpdates on B ++-Trees: Deletion-Trees: Deletion

Otherwise, if the node has t oo few entries d ue to the removal,

and the entries i n the node and a sibling t into a single node,thenRedistribute the pointers b etween the node and a sibling such thatboth have m ore than the minimum number of entries.

Update the corresponding search-key va lue in the parent of the

node.The node deletions m ay ca scade upwards till a node which hasn/2 or more pointers is found. If the root node has o nly one

pointer after deletion, it is deleted and the sole child becomes theroot.

Page 30: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 30/75

©Silberschatz, Korth and Sudarshan12.30Database System Concepts

Examples o f BExamples o f B ++-Tree Deletion-Tree Deletion

The removal of the leaf node containing “Downtown” did notresult in its parent having too little pointers. So the cascadeddeletions s topped with the deleted leaf node’s p arent.

Before and after deleting “Downtown”

f

Page 31: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 31/75

©Silberschatz, Korth and Sudarshan12.31Database System Concepts

Examples o f BExamples o f B ++-Tree Deletion (Cont.)-Tree Deletion (Cont.)

Node with “Perryridge” becomes u nderfull (actually e mpty, in this s pecial case)and merged with its s ibling.As a result “Perryridge” node’s p arent became underfull, and was m erged with itssibling (and an entry was d eleted from their parent)Root node then had only one child, and was d eleted and its ch ild became the newroot node

Deletion of “Perryridge” from result of previous e xample

lf l()

Page 32: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 32/75

©Silberschatz, Korth and Sudarshan12.32Database System Concepts

Example of BExample of B ++-tree Deletion (Cont.)-tree Deletion (Cont.)

Parent of leaf containing Perryridge became underfull, and borrowed apointer from its left sibling

Search-key va lue in the parent’s p arent changes a s a result

Before and after deletion of “Perryridge” from earlier example

B+TFilOii

Page 33: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 33/75

©Silberschatz, Korth and Sudarshan12.33Database System Concepts

BB ++-Tree File Organization-Tree File Organization

Index le degradation problem is s olved by using B +-Treeindices. Data le d egradation problem is so lved by usingB+-Tree File Organization.

The leaf nodes in a B +-tree le organization store records,instead of pointers.

Since records a re larger than pointers, the maximumnumber of records t hat can be stored in a leaf node is l essthan the number of pointers in a nonleaf node.

Leaf nodes are still required to be half full.

Insertion and deletion are handled in the same way asinsertion and deletion of entries in a B +-tree index.

B+l ()TFilOii(C)

Page 34: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 34/75

©Silberschatz, Korth and Sudarshan12.34Database System Concepts

BB ++-Tree File Organization (Cont.)-Tree File Organization (Cont.)

Good space utilization important since records u se more space than pointers.To improve space utilization, involve more sibling nodes in redistribution during splits and merges

Involving 2 siblings i n redistribution (to avoid split / merge where possible) results i n each node having atleast entries

Example of B +-tree File Organization

3/2 n

dlBTIdFil

Page 35: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 35/75

©Silberschatz, Korth and Sudarshan12.35Database System Concepts

B-Tree Index FilesB-Tree Index Files

Similar to B+-tree, but B-tree allows search-key values toappear only once; eliminates redundant storage o searchkeys!

Search keys in nonlea nodes appear nowhere else in the B-tree; an additional pointer ield or each search key in anonlea node must be included!

"enerali#ed B-tree lea node

Nonleaf node – pointers Bi a re the bucket or le record pointers.

BTIdFilE lBTIdFilE l

Page 36: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 36/75

©Silberschatz, Korth and Sudarshan12.36Database System Concepts

B-Tree Index File ExampleB-Tree Index File Example

B-tree (above) and B+-tree (below) on same data

BTIdFil(C)BTIdFil(C)

Page 37: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 37/75

©Silberschatz, Korth and Sudarshan12.37Database System Concepts

B-Tree Index Files (Cont.)B-Tree Index Files (Cont.)

Advantages o f B-Tree indices:May u se less t ree nodes than a corresponding B +-Tree.

Sometimes p ossible to nd search-key va lue before reaching leafnode.

Disadvantages o f B-Tree indices:Only s mall fraction of all search-key values a re found early

Non-leaf nodes a re larger, so fan-out is r educed. Thus, B-Treestypically have greater depth than corresponding B +-Tree

Insertion and deletion more complicated than in B +-Trees

Implementation is harder than B +-Trees.

Typically, advantages o f B-Trees d o not out weigh disadvantages.

SiHhiSttiHhi

Page 38: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 38/75

©Silberschatz, Korth and Sudarshan12.38Database System Concepts

Static HashingStatic Hashing

A bucket is a unit of storage containing one or more records ( a

bucket is typically a disk block).In a hash le o rganization we obtain the bucket of a recorddirectly from its search-key value using a hash function.

Hash function h is a function from the set of all search-keyvalues K to the set of all bucket addresses B.

Hash function is used to locate records f or access, insertion aswell as deletion.

Records with different search-key values may be m apped tothe same bucket; thus e ntire bucket has to be searched

sequentially to locate a record.

E lfHhFilOii(C)E lfHhFilO ii(C)

Page 39: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 39/75

©Silberschatz, Korth and Sudarshan12.39Database System Concepts

Example o f Hash File Organization (Cont.)Example o f Hash File Organization (Cont.)

There a re 10 buckets,

The binary representation of the ith character is a ssumed to

be the integer i.The hash function returns t he sum of the binaryrepresentations o f the characters modulo 10

E.g. h(Perryridge) = 5 h(Round Hill) = 3 h(Brighton) = 3

Hash le organization of account le, using branch-name as key (See gure in next slide.)

E lfHhFilOitiE lfHhFilOgiti

Page 40: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 40/75

©Silberschatz, Korth and Sudarshan12.40Database System Concepts

Example of Hash File OrganizationExample of Hash File OrganizationHash le organization of account le, using branch-name as key (see previous slide for details).

HhFtiHhFti

Page 41: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 41/75

©Silberschatz, Korth and Sudarshan12.41Database System Concepts

Hash FunctionsHash Functions

Worst has function maps a ll search-key values t o the samebucket; this m akes a ccess time proportional to the number ofsearch-key va lues in the le.An ideal hash function is uniform , i.e., each bucket is a ssignedthe same number of search-key va lues from the set of all possible values.

Ideal hash function is random , so each bucket will have thesame number of records a ssigned to it irrespective of the actualdistribution of search-key values i n the le.

Typical hash functions perform computation on the internalbinary representation of the search-key.

For example, for a string search-key, the binary representations o fall the characters in the string could be added and the sum modulothe number of buckets co uld be returned. .

HdligfBktOHandlingofBucketOverows

Page 42: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 42/75

©Silberschatz, Korth and Sudarshan12.42Database System Concepts

Handling of Bucket OverowsHandling of Bucket Overows

Bucket overow can occur because o fInsufficient buckets

Skew in distribution of records. This c an occur due to tworeasons:

multiple records h ave same search-key value

chosen hash function produces n on-uniform distribution of keyvalues

Although the probability of bucket overow can be reduced, itcannot be eliminated; it is h andled by u sing overow buckets .

HandlingofBcketOeros(Cont)HandlingofBucketOverows(Cont)

Page 43: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 43/75

©Silberschatz, Korth and Sudarshan12.43Database System Concepts

Handling of Bucket Overows (Cont.)Handling of Bucket Overows (Cont.)

Overow chaining – the overow buckets o f a given bucket arechained together in a linked list.

Above sch eme is ca lled closed hashing . An alternative, called open hashing , which does not use overowbuckets, is n ot suitable for database applications.

HashIndicesHashIndices

Page 44: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 44/75

©Silberschatz, Korth and Sudarshan12.44Database System Concepts

Hash IndicesHash Indices

Hashing can be used not only for le organization, but also for

index-structure creation.A hash index organizes t he search keys, with their associatedrecord pointers, into a hash le structure.

Strictly speaking, hash indices are always secondary indicesif the le itself is o rganized using hashing, a separate primary h ashindex on it using the same search-key is u nnecessary.However, we use the term hash index to refer to both secondaryindex s tructures a nd hash organized les.

ExampleofHashIndexExampleofHashIndex

Page 45: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 45/75

©Silberschatz, Korth and Sudarshan12.45Database System Concepts

Example of Hash IndexExample of Hash Index

DecienciesofStaticHashingDecienciesofStaticHashing

Page 46: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 46/75

©Silberschatz, Korth and Sudarshan12.46Database System Concepts

Deciencies of Static HashingDeciencies of Static Hashing

In static hashing, function h maps se arch-key values to a xedset of B of bucket addresses.

Databases g row with time. If initial number of buckets is t oo small,performance will degrade due to too much overows.

If le size at some point in the future is a nticipated and number ofbuckets a llocated accordingly, signicant amount of space will bewasted initially.

If database shrinks, again space will be wasted.

One option is p eriodic r e-organization of the le with a new hashfunction, but it is very expensive.

These problems ca n be avoided by using techniques that allow

the number of buckets t o be modied dynamically.

DynamicHashingDynamicHashing

Page 47: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 47/75

©Silberschatz, Korth and Sudarshan12.47Database System Concepts

Dynamic HashingDynamic Hashing

Good for database that grows a nd shrinks i n sizeAllows t he hash function to be modied dynamicallyExtendable h ashing – o ne form of dynamic ha shing

Hash function generates va lues o ver a large range — typically b-bitintegers, with b = 32.At any time use only a prex of the hash function to index into atable of bucket addresses.Let the length of the prex be ibits, 0 ≤ i ≤ 32.

Bucket address t able size = 2 i. Initially i = 0

Value of i grows a nd shrinks a s t he size of the database grows a ndshrinks.

Multiple entries i n the bucket address table may p oint to a bucket.Thus, actual number of buckets is < 2 i

The number of buckets a lso changes dynamically due tocoalescing and splitting of buckets.

GeneralExtendableHashStructureGeneralExtendableHashStructure

Page 48: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 48/75

©Silberschatz, Korth and Sudarshan12.48Database System Concepts

General Extendable Hash StructureGeneral Extendable Hash Structure

In this structure, i2 = i3 = i, whereas i1 = i – 1 (seenext slide for details)

UseofExtendableHashStructureUseofExtendableHashStructure

Page 49: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 49/75

©Silberschatz, Korth and Sudarshan12.49Database System Concepts

Use of Extendable Hash StructureUse of Extendable Hash StructureEach bucket j st ores a value i j ; all the entries that point to thesame bucket have the same values o n the rst i j b its.

To locate the bucket containing search-key K j :1.Compute h(K j ) = X

2.Use t he rst i h igh order bits of X a s a displacement into bucketaddress table, and follow the pointer to appropriate bucket

To insert a record with search-key value K j follow same procedure as look-up and locate the bucket, say j .

If there is room in the bucket j insert record in the bucket.

Else the bucket must be split and insertion re-attempted (next slide.)

Overow buckets u sed instead in some cases ( will see shortly)

UpdatesinExtendableHashStructureUpdatesinExtendableHashStructure

Page 50: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 50/75

©Silberschatz, Korth and Sudarshan12.50Database System Concepts

Updates in Extendable Hash StructureUpdates in Extendable Hash Structure

If i > i j (more than one pointer to bucket j )allocate a new bucket z , and set i j and i z to the old i j -+ 1.

make the second half of the bucket address table entries p ointing to j to point to z

remove and reinsert each record in bucket j.

recompute new bucket for K j and insert record in the bucket (furthersplitting is required if the bucket is still full)

If i = i j (only one pointer to bucket j )increment i a nd double the size of the bucket address table.

replace each entry in the table by two entries t hat point to the samebucket.recompute new bucket address t able entry for K j

Now i > i j so u se t he rst case a bove.

To split a bucket j when inserting record with search-key va lue K j :

UpdatesinExtendableHashStructureUpdatesinExtendableHashStructure

Page 51: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 51/75

©Silberschatz, Korth and Sudarshan12.51Database System Concepts

Updates in Extendable Hash StructureUpdates in Extendable Hash Structure(Cont.)(Cont.)

When inserting a value, if the bucket is full after several splits(that is, i reaches so me limit b) create an overow bucket insteadof splitting bucket entry table further.To delete a key va lue,

locate it in its bucket and remove it.The bucket itself can be removed if it becomes e mpty (withappropriate updates t o the bucket address table).Coalescing of buckets ca n be done (can coalesce only with a“buddy” bucket having same value of i j and same i j –1 prex, if it ispresent)Decreasing bucket address table size is also possible

Note: decreasing bucket address table size is a n expensiveoperation and sh ould be done only if number of buckets be comesmuch smaller than the size of the table

UseofExtendableHashStructure:UseofExtendableHashStructure:

Page 52: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 52/75

©Silberschatz, Korth and Sudarshan12.52Database System Concepts

Use of Extendable Hash Structure:Use of Extendable Hash Structure:ExampleExample

Initial Hash structure, bucket size = 2

Example(Cont)Example(Cont)

Page 53: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 53/75

©Silberschatz, Korth and Sudarshan12.53Database System Concepts

Example (Cont.)Example (Cont.)

Hash structure after insertion of one Brighton and two

Downtown records

Example(Cont)Example(Cont)

Page 54: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 54/75

©Silberschatz, Korth and Sudarshan12.54Database System Concepts

Example (Cont.)Example (Cont.)

Hash structure after insertion of Mianus record

Example(Cont)Example(Cont)

Page 55: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 55/75

©Silberschatz, Korth and Sudarshan12.55Database System Concepts

Example (Cont.)Example (Cont.)

Hash structure after insertion of three Perryridge records

Example(Cont)Example(Cont)

Page 56: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 56/75

©Silberschatz, Korth and Sudarshan12.56Database System Concepts

Example (Cont.)Example (Cont.)

Hash structure after insertion of Redwood and Round Hillrecords

ExtendableHashingvsOtherSchemesExtendableHashingvsOtherSchemes

Page 57: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 57/75

©Silberschatz, Korth and Sudarshan12.57Database System Concepts

Extendable Hashing vs. Other SchemesExtendable Hashing vs. Other Schemes

Benets o f extendable hashing:

Hash performance does n ot degrade with growth of leMinimal space ove rhead

Disadvantages o f extendable hashingExtra level of indirection to nd desired record

Bucket address table may itself become very big (larger thanmemory)

Need a tree structure to locate desired record in the structure!

Changing size of bucket address t able is a n expensive operation

Linear hashing is a n alternative mechanism which avoids t hese

disadvantages a t the possible cost of more bucket overows

Comparison of Ordered Indexing and HashingComparison of Ordered Indexing and Hashing

Page 58: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 58/75

©Silberschatz, Korth and Sudarshan12.58Database System Concepts

Cost of periodic re-organization

Relative frequency o f insertions a nd deletionsIs it desirable to optimize average access time at the expense ofworst-case access time?

Expected type of queries:

Hashing is generally better at retrieving records having a speciedvalue of the key.

If range queries a re common, ordered indices a re to be preferred

IndexDenitioninSQLIndexDenitioninSQL

Page 59: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 59/75

©Silberschatz, Korth and Sudarshan12.59Database System Concepts

Index Denition in SQLIndex Denition in SQL

Create an index

create index <index-name> on <relation-name>(<attribute-list>)

E.g.: create index b-index on branch(branch-name)

Use create unique index to indirectly specify and enforce thecondition that the search key is a candidate key is a candidate

key.Not really required if SQL unique integrity constraint is supported

To drop an indexdrop index <index-name>

Multiple-KeyAccessMultiple-KeyAccess

Page 60: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 60/75

©Silberschatz, Korth and Sudarshan12.60Database System Concepts

MultipleKey AccessMultipleKey AccessUse multiple indices for certain types of queries.

Example:select account-number

from account

where branch-name = “Perryridge” and balance = 1000

Possible strategies for processing query using indices on single attributes:1.Use index on branch-name to nd accounts w ith balances o f $1000; test

branch-name = “ Perryridge”.

2. Use index on balance to nd accounts w ith balances o f $1000; test branch-name = “Perryridge”.

3.Use branch-name index to nd pointers to all records pertaining to thePerryridge branch. Similarly use index on balance . Take intersection of bothsets of pointers obtained.

IndicesonMultipleAttributesIndicesonMultipleAttributes

Page 61: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 61/75

©Silberschatz, Korth and Sudarshan12.61Database System Concepts

Indices on Multiple AttributesIndices on Multiple Attributes

With the where c lausewhere branch-name = “Perryridge” and balance = 1000the index on the combined search-key will fetch only recordsthat satisfy both conditions.Using se parate indices in less efficient — we m ay fetch many

records (or pointers) that satisfy only one of the conditions.Can also efficiently handlewhere branch-name = “Perryridge” and balance < 1000

But cannot efficiently handlewhere branch-name < “Perryridge” and balance = 1000May fetch many records t hat satisfy the rst but not thesecond condition.

Suppose we have an i ndex on com bined sea rch-key(branch-name, balance ).

GridFilesGridFiles

Page 62: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 62/75

©Silberschatz, Korth and Sudarshan12.62Database System Concepts

Grid FilesGrid Files

Structure used to speed the processing of general multiplesearch-key q ueries involving one or more comparison

operators.The grid le has a single grid array a nd one linear scale foreach search-key a ttribute. The grid array has n umber ofdimensions e qual to number of search-key a ttributes.

Multiple cells of grid array c an point to same bucket

To nd the bucket for a search-key va lue, locate the row andcolumn of its cell using the linear scales and follow pointer

ExampleGridFileforExample Grid File for account account

Page 63: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 63/75

©Silberschatz, Korth and Sudarshan12.63Database System Concepts

Example Grid File forp

QueriesonaGridFileQueries on a Grid File

Page 64: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 64/75

©Silberschatz, Korth and Sudarshan12.64Database System Concepts

Queries on a Grid FileQ

A grid le on two attributes A and B ca n handle queries o f allfollowing forms w ith reasonable efficiency

(a 1 ≤ A ≤ a 2)

(b1 ≤ B ≤ b2)

(a 1 ≤ A ≤ a 2 ∧ b1 ≤ B ≤ b2),.

E.g., to answer ( a 1 ≤ A ≤ a 2 ∧ b1 ≤ B ≤ b2), use linear scales tond corresponding candidate grid array c ells, and look u p all thebuckets p ointed to from those cells.

GridFiles(Cont.)Grid Files (Cont.)

Page 65: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 65/75

©Silberschatz, Korth and Sudarshan12.65Database System Concepts

Grid Files (Cont.)()

During insertion, if a bucket becomes full, new bucket can becreated if more than one cell points to it.

Idea similar to extendable hashing, but on multiple dimensions

If only o ne cell points t o it, either an overow bucket must becreated or the grid size must be increased

Linear scales must be chosen to uniformly d istribute records

across cells. Otherwise there will be too many overow buckets.

Periodic re-organization to increase grid size will help.But reorganization can be very expensive.

Space overhead of grid array ca n be high.R-trees ( Chapter 23) are an alternative

Bitmap IndicesBitmap Indices

Page 66: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 66/75

©Silberschatz, Korth and Sudarshan12.66Database System Concepts

pp

Bitmap indices a re a special type of index designed for efficientquerying on multiple keys

Records in a relation are assumed to be numbered sequentiallyfrom, say, 0

Given a num ber n it must be easy to retrieve record n

Particularly easy if records are of xed size

Applicable on attributes that take on a relatively small number ofdistinct values

E.g. gender, country, state, …

E.g. income-level (income broken up into a small number of levelssuch as 0 -9999, 10000-19999, 20000-50000, 50000- innity)

A bitmap is simply an array of bits

Bitmap Indices (Cont.)Bitmap Indices (Cont.)

Page 67: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 67/75

©Silberschatz, Korth and Sudarshan12.67Database System Concepts

p()p()

In its simplest form a bitmap index on an attribute has a bitmapfor each value of the attribute

Bitmap has a s m any bits a s records

In a bitmap for value v, the bit for a record is 1 if the record has thevalue v for the attribute, and is 0 otherwise

Bitmap Indices (Cont.)Bitmap Indices (Cont.)

Page 68: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 68/75

©Silberschatz, Korth and Sudarshan12.68Database System Concepts

p()p()

Bitmap indices are useful for queries on multiple attributes

not particularly useful for single attribute queriesQueries a re answered using bitmap operations

Intersection (and)

Union (or)

Complementation (not)

Each operation takes t wo bitmaps o f the same size and appliesthe operation on corresponding bits to get the result bitmap

E.g. 100110 AND 110011 = 100010

100110 OR 110011 = 110111

NOT 100110 = 011001Males with income level L1: 10010 AND 10100 = 100 00

Can then retrieve required tuples.

Counting number of matching tuples i s e ven faster

Bitmap Indices (Cont.)Bitmap Indices (Cont.)

Page 69: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 69/75

©Silberschatz, Korth and Sudarshan12.69Database System Concepts

p()p()

Bitmap indices generally very small compared with relation size

E.g. if record is 1 00 bytes, space for a single bitmap is 1 /800 of spaceused by relation.

If number of distinct attribute values i s 8 , bitmap is only 1 % ofrelation size

Deletion needs t o be handled properly

Existence bitmap to note if there is a valid record at a record locationNeeded for complementation

not( A=v ): (NOT bitmap-A-v) AND ExistenceBitmap

Should keep bitmaps for all values, even null value

To correctly handle S QL null semantics for NOT( A=v ):intersect above result with (NOT bitmap-A-Null )

Efficient Implementation of Bitmap OperationsEfficient Implementation of Bitmap Operations

Page 70: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 70/75

©Silberschatz, Korth and Sudarshan12.70Database System Concepts

pp ppp p

Bitmaps ar e packed into words; a single word and (a basic CPUinstruction) computes a nd of 32 or 64 bits a t once

E.g. 1-million-bit maps can be anded with just 31,250 instruction

Counting number of 1s ca n be done fast by a trick:Use each byte to index into a precomputed array of 256 elementseach storing the count of 1s in the binary representation

Can use pairs o f bytes to speed up further at a higher memorycostAdd up the retrieved counts

Bitmaps c an be used instead of Tuple-ID lists a t leaf levels o fB+-trees, for values t hat have a large number of matchingrecords

Worthwhile if > 1/64 of the records h ave that value, assuming atuple-id is 64 bitsAbove technique merges be nets of bitmap and B +-tree indices

Page 71: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 71/75

End of ChapterEnd of Chapter

Partitioned HashingPartitioned Hashing

Page 72: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 72/75

©Silberschatz, Korth and Sudarshan12.72Database System Concepts

gg

Hash values a re split into segments that depend on eachattribute of the search-key.

( A 1, A 2, . . . , A n ) for n a ttribute search-keyExample: n = 2, for customer, search-key b eing(customer-street, customer-city )

search-key value hash value(Main, Harrison) 101 111

(Main, Brooklyn) 101 001(Park, Palo Alto) 010 010(Spring, Brooklyn) 001 001(Alma, Palo Alto) 110 010

To answer equality q uery o n single attribute, need to look u p

multiple buckets. Similar in effect to grid les.

Sequential File ForSequential File For acco untacco unt RecordsRecords

Page 73: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 73/75

©Silberschatz, Korth and Sudarshan12.73Database System Concepts

q

Deletion of “Perryridge” From the BDeletion of “Perryridge” From the B ++-Tree of-Tree ofFi1212Fi1212

Page 74: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 74/75

©Silberschatz, Korth and Sudarshan12.74Database System Concepts

Figure 12.12Figure 12.12

SampleSample accountaccount F ileFile

Page 75: seduling algo

7/23/2019 seduling algo

http://slidepdf.com/reader/full/seduling-algo 75/75