[ieee comput. soc third international conference on computational intelligence and multimedia...

7
Schema - Independent Mediators using Content Semantics to Access Data Resources Mário L. O. Flecha Fundação João Pinheiro Departamento de Ciência daComputação Universidade Federal de Minas Gerais Prodemge [email protected] José Luis Braga Departamento de Informática Universidade Federal de Viçosa Belo Horizonte. MG Brazil [email protected] Abstract This paper reports an experienced and evaluated three layer architectural solution for co-operative database access in real use environment. It looks for mediator software independence from idiosyncrasies of user application interfaces and specific database schemata. The cornerstone to implement this architecture is a set of relational databases and views manipulated by a Relational Inference Machine - RIM.Conceptually It is supported by a semantic model implemented as a modelling technique we have named Content Semantics – CS – which supports pragmatic concept ontology to be expressed as contextualized terms in well-defined semantic domains. Several public databases are daily accessed by mediation services working over millions of data records stored in large databases.Keywords: mediator, database, semantic processing, expert systems, information retrieval, semantic knowledge-based system, co-operative information system. 1. Introduction An important mediators’ goal is semantic support on information searching over heterogeneous and mixed legacy data in a medley of federated databases. A general mediator which covers all topics of interest to any application is highly unwise. The concept of generic and modular components sounds as a better strategy to build flexible configurations with sets of mediators, each one focusing its own scope and service [16]. Mediators may interact with others for data retrieval questions which include data resources stored in less recent technologies [16; 18; 17]. A co-operative interface for database access is helpful to the users applications, specially to make it easier consultations and electronic commerce amongst other services over the infoways. A conceptual support comes from semantic information processing research field. We have taken as basis the work of [10] on a representative computational model about human associative memory and its use to create or obtain meaning from the context. Pursuing a generic mediation architecture, independent from specific

Upload: jl

Post on 20-Mar-2017

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: [IEEE Comput. Soc Third International Conference on Computational Intelligence and Multimedia Applications. ICCIMA'99 - New Delhi, India (23-26 Sept. 1999)] Proceedings Third International

Schema - Independent Mediators using Content Semantics to AccessData Resources

Mário L. O. FlechaFundação João Pinheiro

Departamento de Ciência daComputação Universidade Federal de Minas Gerais

[email protected]

José Luis BragaDepartamento de Informática

Universidade Federal de ViçosaBelo Horizonte. MG Brazil

[email protected]

Abstract

This paper reports an experienced and evaluated three layer architectural solution forco-operative database access in real use environment. It looks for mediator softwareindependence from idiosyncrasies of user application interfaces and specific databaseschemata. The cornerstone to implement this architecture is a set of relational databasesand views manipulated by a Relational Inference Machine - RIM.Conceptually It issupported by a semantic model implemented as a modelling technique we have namedContent Semantics – CS – which supports pragmatic concept ontology to be expressed ascontextualized terms in well-defined semantic domains. Several public databases are dailyaccessed by mediation services working over millions of data records stored in largedatabases.Keywords: mediator, database, semantic processing, expert systems,information retrieval, semantic knowledge-based system, co-operative information system.

1. Introduction

An important mediators’ goal is semantic support on information searching overheterogeneous and mixed legacy data in a medley of federated databases. A generalmediator which covers all topics of interest to any application is highly unwise. Theconcept of generic and modular components sounds as a better strategy to buildflexible configurations with sets of mediators, each one focusing its own scope andservice [16]. Mediators may interact with others for data retrieval questions whichinclude data resources stored in less recent technologies [16; 18; 17]. A co-operativeinterface for database access is helpful to the users applications, specially to make iteasier consultations and electronic commerce amongst other services over theinfoways.

A conceptual support comes from semantic information processing research field.We have taken as basis the work of [10] on a representative computational modelabout human associative memory and its use to create or obtain meaning from thecontext. Pursuing a generic mediation architecture, independent from specific

Page 2: [IEEE Comput. Soc Third International Conference on Computational Intelligence and Multimedia Applications. ICCIMA'99 - New Delhi, India (23-26 Sept. 1999)] Proceedings Third International

database schemata and user application peculiarities, we have developed a techniquenamed Content Semantics - CS - [4; 5]. It is intended to be a concise method tocontextualize, extract and produce semantics infered from byte contents during dataentry time or directly inserted by human specialists in specific domain ontology forlate use in a variety of information recovery purposes.

As defined in the Semantic Theory [13], the lightweight lexicon organizationmight be the lever to facilitate and speed automatic treatment of meaning for co-operative data access mediation, including poorly structured data like HTML or textfiles. Our implementation has used mechanisms based on the lexicon for pragmaticand contextual approach on the treatment of disambiguation, polysemy, synonymyand homonimy [13]. The mechanism exploits lexical structure and its halfwayposition between systematized and non-systematized organization [13]. Lexical levelholds more atomic and indivisible elements which are semantic cornerstones oflanguage and do provide aid for co-operative information retrieval purposes. NaturalLanguage Processing, differently, for its intrinsecal issues has to face syntacticalstructure which is heavier than the lexical one. Syntactical level implies order,function and all the natural language internal complexities and rules. Our work is notconcerned in a mediator with such skill. We have striven for a mediator componentable to offer semantic services and co-operate with the users to access dataresources.

Though, lexical approach touches underlying questions related to ontology,semantic knowledge maintenance and re-usability over data resources in the samesemantic domain, including terms from different idioms [17; 18]. The terms mayprovide some meaning by themselves (autosemantic terms) but the context is themechanism which sets up less ambiguously the meaning [13]. For structureddatabases the context is in general previously known from the semantic domain ofdatafield contents and derived from the processes of an application, which maycreate additional terms external to the datafield contents but completely related tothe context of the database and its tuples. The sets of terms and contexts are onto-components of one or more ontologies that may benefit from the situation but do notbelong to an specific application. Beyond this automatic acquisition there is a non-automatic maintenance for human specialists in a domain which entails the completeprocess of ontology maintenance. The knowledge base is the point of contactbetween the mediator and a specific data resource. The ids of ontology terms arerelated to the ids of one or more data resources instances. Knowledge basemaintenance is similar to ontology, except that it is particular to an application.

Under CS model approach an ontology implements pragmatic concepts ascontextualized lexical terms whose relationship with the instances of fact databasescompose a semantic knowledge base. To use a CS-based knowledge base RIM wasimplemented in order to explore terms delivered by the users in their consultations.When searching for information a mediator based on CS model and RIM shall“understand” contents in datafield bytes with semantics derived mainly fromdomain-specific lexicon used in natural languages . In doing so it relieves users fromthe burdensome duty of learning a query language like SQL, even when simplified[12; 17]. It reduces time consuming dialogue sessions, minimizes useless bytes anddoes not require the user or itself to know details of specific database schemata,something impossible over the Internet.

Page 3: [IEEE Comput. Soc Third International Conference on Computational Intelligence and Multimedia Applications. ICCIMA'99 - New Delhi, India (23-26 Sept. 1999)] Proceedings Third International

2. An architecture for independence

In a three layer architecture, to be a middle independent component a mediatorcannot be restricted to application specifics (upper layer) or DBMS protocols (lowerlayer) [17; 18; 16]. To attain this goal at the upper layer user application plays therole to interface externally with users and internally with mediation layer interface.This way a mediator becomes widely reusable and useful to upper layer clients [5],providing a downward interface able to separate itself from data resource lower layeryet providing a clear way to request services from underneath [5] (see figure 1).

Figure 1 . Basic mediaton architecture

SQL alone does not provide user and user application with all flexibility andcognition needed to interface co-operatively and independently with data resourceslayer. Usual indexation are not enough [17; 18; 16]. CS and RIM provide a way toimplement a more powerful indirect indexation ([4], [5]).

( a ) E x a m p le : c i t y * S e a t t l e ; s t a t e * W S ; y e a r * 1 9 9 8 . T h ec o n t e x t d a t a b a s e , b e y o n d t h e p r e f ix , k e e p s p r o c e s s in gi n f o r m a t io n fo r t e r m t r e a t m e n t , l i k e p h o n e t i z a t i o n ,w o r d s b r e a k in g e t c

C o n t e x tP r e f ix * T e r m = C o n t e x t u a li z e d T e r m

O n t o l o g y D o m a i n

U ser’sA pplication

R IM ’s A ux iliaryO bjects

F acts D atabases - X,Y ,Z...A nd Instances

F acts da tabase instance X C ontex tua lizedT erm (Sem antic K now ledge B ase)

C ontextualizedT erm (O ntology)

D atabase X

In stance 1

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

D atabase X Instance 1 Term 3

.

.

.

.

.

.

.

.

.

.

.

.

term 1

.

.

.

.

.

.

.

.

.

.

.

.

R IM

U ser *

K now ledge A cquisitionC onsultation

* U ser could be a person or p rogram

M edia to rM e diator M edia tor

D atabase Y

D atabase Z

Instance 2

Instance N

In stance 1

.

Instance 2

Instance N.

Instance N

D atabase X Instance 1 Term 2D atabase X Instance 2 Term 90

D atabase X Instance 1000 Term 10

D atabase Y Instance 5 Term 100D atabase Y Instance 3 Term 100

term 2term 3

term 10

term 30

term 100

term 1000

term K

D atabase L Instance 2 Term 2000

D atabase Z Instance Z Term K

R elations and com pos iteV iew s of RIM

D ow nw ard

U pw ard

U ser’sA pplication

A nsw er(set of tuple Ids)

Q ues tion(set ofquestions

K now ledgeA cqu isitionM ethods

Page 4: [IEEE Comput. Soc Third International Conference on Computational Intelligence and Multimedia Applications. ICCIMA'99 - New Delhi, India (23-26 Sept. 1999)] Proceedings Third International

Figure2: CS ontology’s ontocomponents

Contextualized terms may belong to one or more ontologies. They are created andcaptured early by semantic knowledge acquisition methods at the mediator upwardinterface. By its turn, RIM offers a co-operative way to implement userconsultations, not merely responding reactively but properly answering withmeaningful information [14]. RIM has a conceptual and logical schema that can beimplemented in different ways. Architectural issues coped towards schema-independent mediators include layers openness with high degree of couplingbetween them [11; 3; 17]. Open layer-based architectures generate high coupling andcompromise reusability and modularity [11; 3; 17]. Closed layer architectures canbetter achieve schema independence [11; 3; 17].

CS and RIM increase mediator’s fan-in due to easy-connection and recoveringcapacity over data resources whose universe of discourse belongs to the samesemantic domain, in spite of the underlying implementation of them. Indeed, there isa database environment where RIM executes and resides, but it does not concern theexecution environment of user applications or data resource layers. An individualwill make personal use of terms set to compose his speech habits [13]. This is therepertory that each one uses in situations where linguistic events happen, includingdatabase consultations and man-machine dialogs. Hence, an up-to-date dictionary ofbasic terms shall be kept as close as possible to completion, plus a context andcontextualized term databases which are ontocomponents based on common sensepragmatic concepts expressed with terms (not exclusively words but alsoabbreviations, acronyms, expressions, compound words, etc.).

3. Evaluations

There are mediator implementations specialized in bibliographical search, policecriminal search and identification, State’s institutional and departmental name searchamong others. There are use cases where information is merged from differentDBMS models. In fact, as said before, there is no restrictions in this matter. Forexample, the State Police Information System has three fact databases, which keepdata about eleven million citizens, with a semantic knowledge base of sixty milliontuples and ontology of twenty million terms. Another system keeps informationabout four hundred thousand State’s public servants with a forty million tuplessemantic knowledge base and a ten million terms ontology.

A comparison between two systems was set to evaluate and measure the degree ofco-operation with database consultant users (tables 1 and 2). One of them uses amediator component aid and the other doesn’t. In this simple use case we havefocused measuring of semantical power of our mediator. It was not our concern toshow the complete set of facilities which includes to search multiple heterogeneousdata resources. Systems compared are both specialized in localities from all Braziland have the same original source data: the Post Office. Data ranges from states,cities, towns, quarters and all kinds of streets. Some mediators have biggerknowledge bases and ontologies than location mediator but it is the one with higherfan-in and concurrent use on the evaluated environment. Non-mediated system,whose name is CepDigital, is a software product for personal use that executes instand-alone equipments. It offers an external graphical interface and explores string-matching techniques for postal data recovery. Mediated system works in aninteractive, concurrent, multi-user environment. It is supported by a mainframe that

Page 5: [IEEE Comput. Soc Third International Conference on Computational Intelligence and Multimedia Applications. ICCIMA'99 - New Delhi, India (23-26 Sept. 1999)] Proceedings Third International

hosts localities fact database and mediators database components, which areintensively called in by corporate user applications. Locality information found bythe mediator is used in a variety of ways by user applications (e.g., to display postalinformation, to make data entry validation, to standardize or convert it into alocation sequential number which hides or represents datafield like birthplace,residential address, commercial address and so on). Compared systems execute inquite distinct scenarios, therefore the comparison is limited over the same datacollection, using the same Consultation terms in different environments and media.

It was not our goal to answer questions related to external interfacecharacteristics, hardware performance, RDBMS performance or environmentalissues. Yet, RIM technical issues related to selectivity in large knowledge bases(from the order of dozens of millions of tuples) required performance solutions. Thesolution of such restrictions has produced drastic response time reduction. Evaluatedmediator’s upward interface was implemented in a 4GL under non-object orientedparadigm. Downward interface uses solely SQL over a set of relations and views,which inter-operate together and are started from upward interface. This mediator isused for several corporative systems at the level of State or State’s Secretariat ofMinas Gerais, Brazil. Some of them are intensively used 24 hours, seven days aweek, under concurrence. The absence of OO paradigm has produced someundesirable effects on programming code, ontology and knowledge maintenance. Aproject of an OO version is under specification.

3.1. City Search

Table 1 presents data produced during tests. Three cities’ names were searchedusing different writings: Belo Horizonte, São Thomé das Letras and Rio de Janeiro.Beyond the city name the State initials may or may not be informed. The table showsresults carried out with each software. “Y” stands for Yes, “N” stands for No, “BL”means Big List of items found for user choice (more than forty items), “SL” meansSmall List of items found for user choice (less than forty items).

Table 1: City’s name consultation and results returned from each application

City Names Did CepDigitalFind?

Did the MediatorFind?

Without State With State Without State WithStateBH N N Y YBHZ N N Y YB Horizonte N N Y YB Orizonte N N Y YBelo Horizonte Y Y Y YBelo N N N NSão Thomé das Letras N N Y YSão Tomé das Letras Y Y Y YSão Tomé Letras N N Y YSão Tomé BL N SL SLRio N N Y YR de Janeiro N N Y YR Janeiro N N Y YR J N N Y YRio de Janeiro Y Y Y YSão Sebastião do Rio de Janeiro N N Y Y

Page 6: [IEEE Comput. Soc Third International Conference on Computational Intelligence and Multimedia Applications. ICCIMA'99 - New Delhi, India (23-26 Sept. 1999)] Proceedings Third International

3.2. Location Search

Table 2 shows test results with one street searched using different spelling for itsoriginal name: Professor Aníbal de Matos. Beyond street’s name, the State initials,city name, kind of street and quarter may or may not be informed.

Table 2: Location name Consultation and results returned from each software under testState City Kind

of StreetStreet’s Name Quarter’s

NameDid

CepDigitalfind?

DidMediator

find? Aníbal Matos* São

Pedro**N SL

St Aníbal Matos São Pedro N SL

Street Aníbal Matos São Pedro N SL

Street Professor Aníbal Matos São Pedro N Y

Street Professor Aníbal de Matos São Pedro N Y

Avenue Prof.Aníbal Matos São Pedro N Y

MG BeloHorizont

e

Street Professor Aníbal de Matos SantoAntônio

N Y

MG BeloHorizont

e

Street Prof Anïbal de Matos SantoAntônio

N Y

MG BeloHorizont

e

St Professor Aníbal de Matos SantoAntônio

Y Y

MG BeloHorizont

e

Street Professor Aníbal de Matos orS Antônio

NN

BLBL

MG BeloHorizont

e

Street Professor orS Antônio

BLBL

BLSL

MG BeloHorizont

e

Street Anïbal orS Antônio

N SL

4. Conclusion

Context treatment, synonyms, term range Consultations, multiple databases andusers, multiple and simultaneous queries per user, approximate answers and schemaabstraction are some aspects already implemented, tested and in use. Severalsolutions for caching, selectiveness, concurrency, bottlenecks and combinatoryexplosion were added to the real scenario mentioned in section 4 ([5] - showsdetails). A high fan-in was achieved by location’s mediator but due to non-objectoriented programming used to implement it, the degree of encapsulation and re-usestill can be better. This is the point where our research is being conducted. Animproved co-operation might occur among datawarehouse, agents and mediationneighboring research areas. Standards like Corba may provide a more general basisfor that. In the future, agent-mediators might be the integration element over thenetworks, increasing density and relevance of information through easy interfaces,able to hide complexity and sustain the decision making process to face changingspeed.

* Mediator’s answer would not be modified if the street name or its nickname had been writen in similar ways butnot exactly (including incompleteness). In example: Haníbal Mattos, Aníbal de Mattos, Mattos Aníbal etc.Theoficial name is Professor Aníbal de Matos. The kind of street may vary and be confused: St., Street, Av., Avenue,Square etc.** The street informed is located in Santo Antônio quarter.

Page 7: [IEEE Comput. Soc Third International Conference on Computational Intelligence and Multimedia Applications. ICCIMA'99 - New Delhi, India (23-26 Sept. 1999)] Proceedings Third International

References

[1] Vannevar Bush. As We May Think. The Atlantic Monthly. July, 1945. p. 13. Ottawa, Canada:University of Ottawa. DUCHIER, Denys (ed). HTML version available in<www.isg.sfu.ca/~duchier/misc/vbush/vbush.all.shtml>

[2] Eduardo T. Damasceno. RIM Utilization Report. Belo Horizonte, MG: Prodemge, DTP/STP/GPP,and July 30th, 1997 (Performance Report, in Portuguese language).

[3] Gerhard Fisher. Domain-Oriented Design Environments. Automated Software Engineering,p.177-203, 1994.

[4] Mário L.O. Flecha. Semantic Information Processing. Building a Relational Inference Machine -RIM - for Cognitive Systems’ Development. Belo Horizonte, MG: Prodemge, 1994, p.68. (TechnicalReport No. 25, in Portuguese).

[5] Mário L.O. Flecha. Semantic Processing on the Mediation of Public Database Querying. MSc.Thesis. Belo Horizonte: Fundação João Pinheiro - Escola de Governo/Departamento de Ciência daComputação, Universidade Federal de Minas Gerais, 1997. 223p. (in Portuguese).

[6] Michael Genesereth et al. Reference Architecture for the Intelligent Integration of InformationRetrieval. Prepared by the Program on Intelligent Integration of Information (I3). Version 2.0 (draft).DARPA - Defense Advanced Research Projects Agency, August 2, 1995. 90p.

[7] Michael Genesereth, Richard E. Fikes et al., KIF - Knowledge Interchange Format. Version 3.0Reference Manual. Stanford, CA: Computer Science Department, Stanford University. (Report Logic-92-1) June 1992. 68p.

[8] Peter E. Hart, Jamey Graham. Query-free Information Retrieval. In: Co-operative InformationSystems. IEEE Expert. p.32-37, Sep/Oct, 1997.

[9] John Mylopoulos, Michael Papazoglou. Co-operative Information Systems. In: Guest Editors’Introduction. IEEE Expert. p.28-37, Sep/Oct, 1997.

[10] QUILLIAN, Ross. Semantic Memory. In: Semantic Information Processing. Cambridge, Mass.:M.I.T. Minsky, Marvin (ed). Ed. The MIT, 1968, p.227-270.

[11] RUMBAUGH, J. et al. Modelagem e Projetos Baseados em Objetos. Trad. Dalton Conde deAlencar, Rio de Janeiro, Rio de Janeiro: Ed. Campus, 1994, 654p. (Translation of: Object-orientedmodeling and design. Englewood Cliffs, New Jersey: Ed. Prentice-Hall, 1991).

[12] SHNEIDERMAN, Bem – University of Maryland. Designing the User Interface – Strategies forEffective Human-Computer Interaction. 2nd ed. Massachusets, MA: Ed. Addison-Wesley, 1993, 573p.

[13] ULLMANN, Stephen. Semântica - Uma Introdução à Ciência do Significado. 4.ed. (Portuguese).Trad. J. A. Osório Mateus. Lisbon, Portugal: Fundação Calouste Gulbenkian, 1977. 578p. (Translationof: Semantics - An Introduction to the Science of Meaning. Oxford: Ed. Basil Blackwell, 1964).

[14] WEBBER, Bonnie L., Questions, Answers and Responses: Interacting with Knowledge BaseSystems. In: On Knowledge Base Management Systems. Integrating Artificial Intelligence andDatabase Technologies – Topics in Information Systems. Michael L. Brodie/John Mylopoulos (ed).Ed. Springer-Verlag, 1986, p.365-402.

[15] Gio Wiederhold, Michael Genesereth. Basis for Mediation. May 1995, In: PROC.INTERNATIONAL CONFERENCE ON CO-OPERATIVE INFORMATION SYSTEMS (CoopIS95).Vienna, Austria: available in <[email protected]>, May 1995. p.138-155.

[16] Gio Wiederhold, Michael Genesereth. The Conceptual Basis for Mediation Services. In: Co-operative Information Systems. IEEE Expert. p.38-47, Sep/Oct, 1997.

[17] Gio Wiederhold. Mediators in the Architecture of Future Information Systems. Computer, p.38-49, 1992.

[18] WIEDERHOLD, Gio. Foreword and Glossary: Intelligent Integration of Information, in:Intelligent Integration of Information. Norwell, Mass.: Gio Wiederhold (ed). Ed. Kluwer AcademicPublishers: p.5-9, p.193-201 May/June, 1996. (Journal of Intelligent Information Systems, IntegratingArtificial Intelligence and Database Technologies, 6(2/3)).