estruturas de dados ii lisch eisch

8/20/2019 Estruturas de dados II LISCH EISCH

1/38

9/24/2014 1

Gazi University

Computer Engineering Department

BM307 –

File Organization


2/38

9/24/2014 2

Index

Sequential File Organization

Binary Search

Interpolation Search

Self-Organizing Sequential Search

Direct File Organization Locating Information

Hashing Functions Collision Resolution

Coalesced Hashing


3/38

9/24/2014 3

File Organization

Goal Organizing files efficiently in terms of both

space and performance

File Organization File Access

sequential sequential

indexed sequential sequential & direct

direct direct (random)


4/38

9/24/2014 4

File Access Types

Sequential – accessing multiple records (often an entirefile) and usually according to a predefined order

Direct (random) – locating a single record

Question How can we have an effective organization?

Answer matching the type of organization with the

type of intended access


5/38

9/24/2014 5


Background

Fields (eg.: Employee name, number) Records contain data about individual entities

Files (eg.: employee list)

Primary Key field(s) which uniquely distinguishes a record

from all others

Secondary Key all the remaining fields


6/38

9/24/2014 6


File consists of records of the same format

Fixed-length records

Variable-length records

Sequential File Organization (i+1)st element of a file

is stored immediately after the ith element.


7/38

9/24/2014 7


Sequential access moving from one record in the

file to the next by incrementing the address of the

current record by the record size

Direct access processing a single record directly if

we know subscript


8/38

9/24/2014 8


Probe access to a distinct location

Sequential Search

In an entire file of N records

N/2 probes are needed in average Need to probe entire file for an unseccessful retrieval

Computational complexity O(N)

Appropriate when N is small

Performance improvement? Sorting


9/38

9/24/2014 9

Eg. - Sequential Search

100000 records, each record size is 400 bytes,block size is 2400 bytes.

Sequential search time for retrieving 10000 records?

Each probe one block of data (100000*400)/2400 = 16667 blocks

Reading time for one block 0.84ms (IBM 3380)

Time requirement for each record

(16667/2)*0.84 = 7 sec. For 10000 records 7sec * 10000 = 19 hours

Better organization is needed!!


10/38

9/24/2014 10


Binary Search

Requires sorting

Compares the key of the sought record with the middle record of the file

Half of the file is eliminated in each turn

Computational complexity O(log2n)

Eg. the key of the sought record 17


11/38

9/24/2014 11

Sequential File OrganizationBinary Search (Algoritma)


12/38

9/24/2014 12

Sequential File OrganizationInterpolation Search

Approximate relative position

Eg.: Searching a name in a telephone book

Choses the next position for a comparison based upon the

estimated position of the sought keyrelative to the remainder

of the file to be searched

NEXT := LOWER + ––––––––––––––––––––––––––––––––––––––––– (UPPER-LOWER)

Worst case computational complexity O(n) Average case computational complexity O(log2 log2n)

Its performance improves as the distribution of keys becomes

more uniform

key[sought] – key [LOWER]

key[UPPER] – key [LOWER]


13/38

binary search should be preferred when the data is stored in

primary memory

Why?

interpolation search should be preferred when the data is stored in

auxilary memory

Why?

9/24/2014 13


14/38

binary search should be preferred when the data is stored in

primary memory

The additional calculations needed for the interpolation search cancel

any savings gained from fewer probes

interpolation search should be preferred when the data is stored in

auxilary memory

An access of auxiliary storage is an order of magnitude greater than

the time required for the additional calculations

9/24/2014 14


15/38

9/24/2014 15

Sequential File OrganizationSelf-Organizing Sequential Search

Modifies the order of records

Moves the most frequently retrieved records to the beginning

of the file

Most popular algorithms:

Move_to_front

Transpose

Count


16/38

9/24/2014 16


Move_to_front

The sought record is moved to the front position of the file

Potential of making big mistakes if a record accessed , moved to the

front of the file, and then rarely if ever accessed again!

A linked implementation is preferable even though it takes more storage

Appropriate when space is not limited and locality of access is important

Essentially the same as the LRU (least recently used) paging algorithm

used by operating systems


17/38

9/24/2014 17

Eg. - Move_to_front

The records are accessed in the order of “fileediting”

a b c d e f g h i j k l m n o p r q s t v w y z

f a b c d e g h i j k l m n o p r q s t v w y z

i f a b c d e g h j k l m n o p r q s t v w y z l i f a b c d e g h j k m n o p r q s t v w y z

e l i f a b c d g h j k m n o p r q s t v w y z

e l i f a b c d g h j k m n o p r q s t v w y z

d e l i f a b c g h j k m n o p r q s t v w y z

i d e l f a b c g h j k m n o p r q s t v w y z t i d e l f a b c g h j k m n o p r q s v w y z

i t d e l f a b c g h j k m n o p r q s v w y z

n i t d e l f a b c g h j k m o p r q s v w y z

g n i t d e l f a b c h j k m o p r q s v w y z


18/38

9/24/2014 18


Transpose

Interchanges the sought record with its immediate predecessor

More stable than the Move_to_front algorithm

A record needs to be accessed many times before it is moved to the

front of the list

Easily implemented

Does not need additional space

Should be used when space is premium


19/38

9/24/2014 19

Eg. - Transpose

The records are accessed in the order of “fileediting”

a b c d e f g h i j k l m n o p r q s t v w y z

a b c d f e g h i j k l m n o p r q s t v w y z

a b c d f e g i h j k l m n o p r q s t v w y z

a b c d e f g i h j k l m n o p r q s t v w y z

a b c e d f g i h j k l m n o p r q s t v w y z

a b c d e f g i h j k l m n o p r q s t v w y z

a b c d e f i g h j k l m n o p r q s t v w y z

a b c d e f i g h j k l m n o p r q t s v w y z

a b c d e i f g h j k l m n o p r q t s v w y z a b c d e i f g h j k l n m o p r q t s v w y z

a b c d e i g f h j k l n m o p r q t s v w y z


20/38

9/24/2014 20


Count

Keeps count of the number of accesses of each record

The file is always ordered in a decreasing order of frequency of

access

Requires extra sorage to keep the count

Use it only when the counts are needed for another purpose


21/38

9/24/2014 21

Direct File Organization

Ideally, we want to go directly to the address where the record is stored

A key can be unique address one probe

More address space than needed

Eg.1 billion addresses for 300 million people

Key

space

Address

space

1 – 1

999-99-9999 999-99-9999

0 0

correspondence


22/38

9/24/2014 22


Converting information into a unique address

Eg. : Airline reservation system

Flight numbers from 1 to 999

Days are numbered from 1 to 366

Flight number and day of the year could be concatenated to determine the

location

Location = flight number || day of the year, address range 001001-999366

(???367 - ???999 would not exist)

Location = day of the year || flight number , address range 001001-366999


23/38

9/24/2014 23


The key converts to a probable address

If we remove most of the empty spaces in the address space, we have

lost the 1-1 correspondence btw keys & addresses

Hashing functions are used to map the wider range of key values into

the narrower range of address values

Hash (key) probable address

Initial probable address home address

Hashing function should

Evenly distribute the keys among the addresses

Executes efficiently


24/38

9/24/2014 24


A collision occurs when two distinct keys map to the same

address

Hashing is then composed of two aspects;

The function

The collision resolution method

Key

space

Address

space

999-99-9999

1200

0

0


25/38


Hashing Functions

9/24/2014 25


26/38

9/24/2014 26

Direct File OrganizationHashing Functions

Squaring

Taking square of a key and then substringing or truncating a portion

of the result

Radix conversion

The key is considered to be in a base other than 10 ans is then

converted into a number in base 10

Eg.: Base 11 1234 = 1 * 113 + 2 * 112 + 3 * 111 + 4 * 110= 1331 + 242 + 33 + 4

= 1610

substringing or truncation could then be used


27/38

9/24/2014 27

Direct File OrganizationHashing Functions

Polynomial hashing

The key is divided by a polynomial

f(information area) cyclic check bytes

Alphabetic keys

Alphabetic or alphanumeric key values can be input to a hashing

function if the values are interpreted as integers


28/38

9/24/2014 28

Direct File OrganizationCollisions

For a given set of data, one hashing function may distribute the keysmore evenly over the address space than another

A hashing function that has a large number of collisions is said to

exhibit primary clustering

It is better to have a slightly more expensive hashing function for datathat need to be stored on auxiliary storage

Another method for reducing collisions is reducing the packing factor

number of records stored

total number of storage locationsPacking factor =

storage

c o l l i s i o n s


29/38

9/24/2014 29

Direct File OrganizationCollision Resolution

Collision resolution with links

Collision resolution without links Static positioning of records

Dynamic positioning of records

Collision resolution with pseudolinks


30/38

9/24/2014 30


Collision resolution with links

If multiple synonyms occur for aparticular home address, we forma chain of synonym records

Disadvantage extra storage isneeded

Collision resolution without links

We can use implied links byapplying a convention , or set ofrules for deciding where to gonext

A simple convention is to look atthe next location in memory

Advantage NO extra storage isneeded


31/38

9/24/2014 31

Direct File OrganizationCoalesced Hashing

Occurs when we attempt to insert a record with a homeaddress that is already occupied by a record from a chain with

a different home address

The two chains with records having different home addresses

coalesce or grow together

X ,D, Y were inserted


32/38

9/24/2014 32


(Eg.) Hash (key) = key mod 11

27, 18, 29, 28, 39, 13, 16

42 & 17 added

Average # of probes 1.8


33/38

9/24/2014 33


Discussion Packing factor of the final table = 9/11 (82%)

One method of reducing coalescing is to reduce the packing factor

It would be advisable to place the most frequently accessed records early inthe insertion process

Deleting a record is complicated

If coalescing has occurred,

a simple deletion procedure is

to move a record later in the

probe chain into the position of

the deleted record

Final table after deleting 39 ---------->


34/38

9/24/2014 34


Variants Table organization (whether or not a seperate

overflow area is used)

The manner of linking a colliding item into a chain

The manner of choosing unoccupied locations

Table Organization

Table primary area + overflow area

Adres factor = (primary area ) / (total table size)

Best performance when the adres factor is 0.86


35/38

9/24/2014 35


Variants Late Insertion Standart Colesced Hashing (LISCH)

New records are inserted at the end ofa probe chain

Lack of a cellar

Late Insertion Coalesced Hashing (LICH)

Uses a cellar

Eg. Keys: 27, 18, 29, 28, 39, 13, 16, 42, 17

hashing function: key mod 7

Average # of probes 1.3

(It was 1.8 for LISCH) In general, for a 90 percent packing factor,

using a cellar will reduce the number of

probes by about 6 percent compared

with LISCH


36/38

9/24/2014 36


Variants

Early Insertion Standart Colesced Hashing (EISCH) İnserts a new record into a position on the probe chain immediately after the record

srored at its home address

İnsertion of the record with key 17 according to EISCH algorithm: Hash (key) = key mod 11


37/38

9/24/2014 37


Variants

Random Early Insertion Standart Colesced Hashing (REISCH) Choosing a random unoccupied location for the new insertion

Gives only a 1% improvement over EISCH

Random Late Insertion Standart Colesced Hashing (RLISCH)

Bidirectional Late Insertion Standart Colesced Hashing (BLISCH) Choosing the overflow location for a collision insertion by alternating the

selection between the top and bottom of the table

Bidirectional Early Insertion Standart Colesced Hashing (BEISCH)


38/38

9/24/2014 38


Comparison

estruturas de dados ii lisch eisch

Documents