estruturas de dados ii lisch eisch

Upload: leonardo-pitta

Post on 07-Aug-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    1/38

    9/24/2014 1

    Gazi University

    Computer Engineering Department

    BM307 –

    File Organization

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    2/38

    9/24/2014 2

    Index

    Sequential File Organization

    Binary Search

    Interpolation Search

    Self-Organizing Sequential Search

    Direct File Organization Locating Information

    Hashing Functions Collision Resolution

    Coalesced Hashing

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    3/38

    9/24/2014 3

    File Organization

    Goal Organizing files efficiently in terms of both

    space and performance

    File Organization File Access

    sequential sequential

    indexed sequential sequential & direct

    direct direct (random)

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    4/38

    9/24/2014 4

    File Access Types

    Sequential – accessing multiple records (often an entirefile) and usually according to a predefined order 

    Direct (random) – locating a single record

    Question How can we have an effective organization?

     Answer  matching the type of organization with the

    type of intended access

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    5/38

    9/24/2014 5

    Sequential File Organization

    Background

    Fields (eg.: Employee name, number) Records contain data about individual entities

    Files (eg.: employee list)

    Primary Key field(s) which uniquely distinguishes a record

    from all others

    Secondary Key all the remaining fields

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    6/38

    9/24/2014 6

    Sequential File Organization

    File consists of records of the same format

    Fixed-length records

    Variable-length records

    Sequential File Organization (i+1)st element of a file

    is stored immediately after the ith element.

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    7/38

    9/24/2014 7

    Sequential File Organization

    Sequential access moving from one record in the

    file to the next by incrementing the address of the

    current record by the record size

    Direct access processing a single record directly if

    we know subscript

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    8/38

    9/24/2014 8

    Sequential File Organization

    Probe access to a distinct location

    Sequential Search

    In an entire file of N records

    N/2 probes are needed in average Need to probe entire file for an unseccessful retrieval

    Computational complexity O(N)

     Appropriate when N is small

    Performance improvement? Sorting

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    9/38

    9/24/2014 9

    Eg. - Sequential Search

    100000 records, each record size is 400 bytes,block size is 2400 bytes.

    Sequential search time for retrieving 10000 records?

    Each probe one block of data (100000*400)/2400 = 16667 blocks

    Reading time for one block 0.84ms (IBM 3380)

    Time requirement for each record

    (16667/2)*0.84 = 7 sec. For 10000 records 7sec * 10000 = 19 hours

    Better organization is needed!!

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    10/38

    9/24/2014 10

    Sequential File Organization

    Binary Search

    Requires sorting

    Compares the key of the sought record with the middle record of the file

    Half of the file is eliminated in each turn

    Computational complexity O(log2n)

    Eg. the key of the sought record 17

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    11/38

    9/24/2014 11

    Sequential File OrganizationBinary Search (Algoritma)

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    12/38

    9/24/2014 12

    Sequential File OrganizationInterpolation Search

     Approximate relative position

    Eg.: Searching a name in a telephone book

    Choses the next position for a comparison based upon the

    estimated position of the sought keyrelative to the remainder

    of the file to be searched

    NEXT := LOWER +  ––––––––––––––––––––––––––––––––––––––––– (UPPER-LOWER)

    Worst case computational complexity O(n)  Average case computational complexity O(log2 log2n)

    Its performance improves as the distribution of keys becomes

    more uniform

    key[sought] – key [LOWER]

    key[UPPER] – key [LOWER]

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    13/38

    binary search should be preferred when the data is stored in

    primary memory

    Why?

    interpolation search should be preferred when the data is stored in

    auxilary memory

    Why?

    9/24/2014 13

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    14/38

    binary search should be preferred when the data is stored in

    primary memory

    The additional calculations needed for the interpolation search cancel

    any savings gained from fewer probes

    interpolation search should be preferred when the data is stored in

    auxilary memory

     An access of auxiliary storage is an order of magnitude greater than

    the time required for the additional calculations

    9/24/2014 14

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    15/38

    9/24/2014 15

    Sequential File OrganizationSelf-Organizing Sequential Search

    Modifies the order of records

    Moves the most frequently retrieved records to the beginning

    of the file

    Most popular algorithms:

    Move_to_front

    Transpose

    Count

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    16/38

    9/24/2014 16

    Sequential File Organization

    Move_to_front

    The sought record is moved to the front position of the file

    Potential of making big mistakes if a record accessed , moved to the

    front of the file, and then rarely if ever accessed again!

     A linked implementation is preferable even though it takes more storage

     Appropriate when space is not limited and locality of access is important

    Essentially the same as the LRU (least recently used) paging algorithm

    used by operating systems

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    17/38

    9/24/2014 17

    Eg. - Move_to_front

    The records are accessed in the order of “fileediting”

    a b c d e f g h i j k l m n o p r q s t v w y z

    f a b c d e g h i j k l m n o p r q s t v w y z

    i f a b c d e g h j k l m n o p r q s t v w y z l i f a b c d e g h j k m n o p r q s t v w y z

    e l i f a b c d g h j k m n o p r q s t v w y z

    e l i f a b c d g h j k m n o p r q s t v w y z

    d e l i f a b c g h j k m n o p r q s t v w y z

    i d e l f a b c g h j k m n o p r q s t v w y z t i d e l f a b c g h j k m n o p r q s v w y z

    i t d e l f a b c g h j k m n o p r q s v w y z

    n i t d e l f a b c g h j k m o p r q s v w y z

    g n i t d e l f a b c h j k m o p r q s v w y z

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    18/38

    9/24/2014 18

    Sequential File Organization

    Transpose

    Interchanges the sought record with its immediate predecessor 

    More stable than the Move_to_front algorithm

     A record needs to be accessed many times before it is moved to the

    front of the list

    Easily implemented

    Does not need additional space

    Should be used when space is premium

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    19/38

    9/24/2014 19

    Eg. - Transpose

    The records are accessed in the order of “fileediting”

    a b c d e f g h i j k l m n o p r q s t v w y z

    a b c d f e g h i j k l m n o p r q s t v w y z

    a b c d f e g i h j k l m n o p r q s t v w y z

    a b c d e f g i h j k l m n o p r q s t v w y z

    a b c e d f g i h j k l m n o p r q s t v w y z

    a b c d e f g i h j k l m n o p r q s t v w y z

    a b c d e f i g h j k l m n o p r q s t v w y z

    a b c d e f i g h j k l m n o p r q t s v w y z

    a b c d e i f  g h j k l m n o p r q t s v w y z a b c d e i f  g h j k l n m o p r q t s v w y z

    a b c d e i g f  h j k l n m o p r q t s v w y z

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    20/38

    9/24/2014 20

    Sequential File Organization

    Count

    Keeps count of the number of accesses of each record

    The file is always ordered in a decreasing order of frequency of

    access

    Requires extra sorage to keep the count

    Use it only when the counts are needed for another purpose

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    21/38

    9/24/2014 21

    Direct File Organization

    Ideally, we want to go directly to the address where the record is stored

     A key can be unique address one probe

    More address space than needed

    Eg.1 billion addresses for 300 million people

    Key

    space

    Address

    space

    1 –  1

    999-99-9999 999-99-9999

    0   0

    correspondence

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    22/38

    9/24/2014 22

    Direct File Organization

    Converting information into a unique address

    Eg. : Airline reservation system

    Flight numbers from 1 to 999

    Days are numbered from 1 to 366

    Flight number and day of the year could be concatenated to determine the

    location

    Location = flight number || day of the year, address range 001001-999366

    (???367 - ???999 would not exist)

    Location = day of the year || flight number , address range 001001-366999

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    23/38

    9/24/2014 23

    Direct File Organization

    The key converts to a probable address

    If we remove most of the empty spaces in the address space, we have

    lost the 1-1 correspondence btw keys & addresses

    Hashing functions are used to map the wider range of key values into

    the narrower range of address values

    Hash (key) probable address

    Initial probable address home address

    Hashing function should

    Evenly distribute the keys among the addresses

    Executes efficiently

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    24/38

    9/24/2014 24

    Direct File Organization

     A collision occurs when two distinct keys map to the same

    address

    Hashing is then composed of two aspects;

    The function

    The collision resolution method

    Key

    space

    Address

    space

    999-99-9999

    1200

    0

    0

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    25/38

    Direct File Organization

    Hashing Functions

    9/24/2014 25

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    26/38

    9/24/2014 26

    Direct File OrganizationHashing Functions

    Squaring

    Taking square of a key and then substringing or truncating a portion

    of the result

    Radix conversion

    The key is considered to be in a base other than 10 ans is then

    converted into a number in base 10

    Eg.: Base 11 1234 = 1 * 113 + 2 * 112 + 3 * 111 + 4 * 110= 1331 + 242 + 33 + 4

    = 1610

    substringing or truncation could then be used

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    27/38

    9/24/2014 27

    Direct File OrganizationHashing Functions

    Polynomial hashing

    The key is divided by a polynomial

    f(information area) cyclic check bytes

     Alphabetic keys

     Alphabetic or alphanumeric key values can be input to a hashing

    function if the values are interpreted as integers

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    28/38

    9/24/2014 28

    Direct File OrganizationCollisions

    For a given set of data, one hashing function may distribute the keysmore evenly over the address space than another 

     A hashing function that has a large number of collisions is said to

    exhibit primary clustering

    It is better to have a slightly more expensive hashing function for datathat need to be stored on auxiliary storage

     Another method for reducing collisions is reducing the packing factor 

    number of records stored

    total number of storage locationsPacking factor =

    storage

         c     o       l       l       i     s       i     o     n     s

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    29/38

    9/24/2014 29

    Direct File OrganizationCollision Resolution

    Collision resolution with links

    Collision resolution without links Static positioning of records

    Dynamic positioning of records

    Collision resolution with pseudolinks

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    30/38

    9/24/2014 30

    Direct File Organization

    Collision resolution with links

    If multiple synonyms occur for aparticular home address, we forma chain of synonym records

    Disadvantage extra storage isneeded

    Collision resolution without links

    We can use implied links byapplying a convention , or set ofrules for deciding where to gonext

     A simple convention is to look atthe next location in memory

     Advantage NO extra storage isneeded

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    31/38

    9/24/2014 31

    Direct File OrganizationCoalesced Hashing

    Occurs when we attempt to insert a record with a homeaddress that is already occupied by a record from a chain with

    a different home address

    The two chains with records having different home addresses

    coalesce or grow together 

     X ,D, Y were inserted 

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    32/38

    9/24/2014 32

    Direct File OrganizationCoalesced Hashing

    (Eg.) Hash (key) = key mod 11

    27, 18, 29, 28, 39, 13, 16

    42 & 17 added

     Average # of probes 1.8

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    33/38

    9/24/2014 33

    Direct File OrganizationCoalesced Hashing

    Discussion Packing factor of the final table = 9/11 (82%)

    One method of reducing coalescing is to reduce the packing factor

    It would be advisable to place the most frequently accessed records early inthe insertion process

    Deleting a record is complicated

    If coalescing has occurred,

    a simple deletion procedure is

    to move a record later in the

    probe chain into the position of

    the deleted record

    Final table after deleting 39 ---------->

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    34/38

    9/24/2014 34

    Direct File OrganizationCoalesced Hashing

    Variants Table organization (whether or not a seperate

    overflow area is used)

    The manner of linking a colliding item into a chain

    The manner of choosing unoccupied locations

    Table Organization

    Table primary area + overflow area

     Adres factor = (primary area ) / (total table size)

    Best performance when the adres factor is 0.86

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    35/38

    9/24/2014 35

    Direct File OrganizationCoalesced Hashing

    Variants Late Insertion Standart Colesced Hashing (LISCH)

    New records are inserted at the end ofa probe chain

    Lack of a cellar 

    Late Insertion Coalesced Hashing (LICH)

    Uses a cellar 

    Eg. Keys: 27, 18, 29, 28, 39, 13, 16, 42, 17

    hashing function: key mod 7

     Average # of probes 1.3

    (It was 1.8 for LISCH) In general, for a 90 percent packing factor,

    using a cellar will reduce the number of

    probes by about 6 percent compared

    with LISCH

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    36/38

    9/24/2014 36

    Direct File OrganizationCoalesced Hashing

    Variants

    Early Insertion Standart Colesced Hashing (EISCH) İnserts a new record into a position on the probe chain immediately after the record

    srored at its home address

    İnsertion of the record with key 17 according to EISCH algorithm: Hash (key) = key mod 11

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    37/38

    9/24/2014 37

    Direct File OrganizationCoalesced Hashing

    Variants

    Random Early Insertion Standart Colesced Hashing (REISCH) Choosing a random unoccupied location for the new insertion

    Gives only a 1% improvement over EISCH

    Random Late Insertion Standart Colesced Hashing (RLISCH)

    Bidirectional Late Insertion Standart Colesced Hashing (BLISCH) Choosing the overflow location for a collision insertion by alternating the

    selection between the top and bottom of the table

    Bidirectional Early Insertion Standart Colesced Hashing (BEISCH)

  • 8/20/2019 Estruturas de dados II LISCH EISCH

    38/38

    9/24/2014 38

    Direct File OrganizationCoalesced Hashing

    Comparison