Home

Core Technologies:
Agent Factory
AFME

Applications:
MiRA
NEXUS
HOTAIR
ACCESS
Mobile Systems

 

 

Application: HOTAIR

 

 

 

 

 

 

1. Introduction

With the volume of information being published in digital form increasing rapidly, there is an increasing need for users to be capable of efficiently and accurately locating documents that satisfy their information need.

The aim of the HOTAIR project is the development of a Multi Agent Information Retrieval (IR) System. An IR system is aimed at identifying documents that satisfy a user's information need, typically expressed in the form of a text query. Documents that are considered to be relevant to the given query are returned in the form of a list ranked from that document considered to be of most relevance.

The use of a Multi Agent system in the development of such a system allows it to be:

  1. extensible, in that it facilitates the incorporation of new IR algorithms and supported document types, and
  2. scalable, so that it is possible to include increasingly large document collections by automatically incorporating additional hardware resources into the system

2. Ongoing Research

To date, a functioning prototype of the HOTAIR architecture has been developed. This consists of two distinct subsystems, which represent the two key components of an IR system. These two subsystems are as follows:

Indexing Subsystem:
The Indexing Subsystem is responsible for the identification and gathering of documents from the World Wide Web, FTP sites, file shares and a variety of other sources. Each document is stored in multiple searchable indeces, which are subsequently utilised by various IR algorithms in order to provide a search result in response to users' queries.
Querying Subsystem:
The Querying Subsystem is responsible for accepting queries from users and returning a ranked list of results in response. It does this by making use of a number of IR algorithms and combining their results using probFuse (see "Associated Research" below) to form a single set of results to return to the user.

These are discussed in more detail in the following sections.

2.1 Indexing Subsystem

The Indexing Subsystem is responsible for the identification and gathering of documents that are to be made searchable. In order for a docuement to be inserted into an index (where it is therefore available to the Querying Subsystem), it must undergo three levels of processing: Data Gathering, Translation and Indexing. Each of these three processing stages is encapsulated by a type of agent, of which many instances exist in the system.

Data Gathering is the process of crawling web sites, FTP sites, shared directories and other data sources and downloading documents that are found there. As these documents may be stored in a variety of file formats, the next stage is to translate each of these into a common internal format, known as the HOTAIR Document Format (HDF), an XML representation of the contents of the original document, along with associated metadata (such as its original location and MIME type). The incorporation of the Translating stage contributes to the extensibility of the system, as support for additional file formats can be added by the introduction of a single new type of Translator.

Once a document has undergone translation, the final step in the indexing process is for it to be stored in a searchable index. As HOTAIR uses multiple IR algorithms to perform its searches, documents may be added to a number of distinct indices. As each IR algorithm can use different data about each document in order to gauge a its relevance to the given query, each index contains data that is specific to the relevant algorithm. Again, separating this stage from the rest of the overall indexing process contributes to the extensibility of the system, as support for further IR algorithms necessitates only the addition of a new type of Indexer agent. Fig 1 illustrates the stages of processing a document must undergo in order to be included in an index.

Fig 1. The stages a document must go through in order to be included in the index.
Indexing Subsystem Illustration

2.2 Querying Subsystem

The Querying Subsystem consists of two types of agent. Query Dispatcher agents accept queries from users and are responsible for returning the final set of results in response. In order to do this, the Query Dispatcher forwards the query to a number of Query Handler agents, each of which performs a search using a different IR algorithm. Once these results have been returned to the Query Dispatcher, they are combined using the probFuse data fusion algorithm (see the "Associated Research" below for further details) into a single result that is then returned to the user.

2.3 Management Agents

The quantity of each type of agent outlined above is allowed to vary in accordance with system demand. Performance management agents are responsible for ensuring that scarce computing resources are being utilised in the most efficient manner. They can do this by taking such actions as creating new agents, terminating agents to free up resources, halting groups of agents that are less in demand and controling the agent platforms on which certain processing tasks are carried out.

Additionally, other management agents are responsible for monitoring the overall health of the system by identifying failed agents and agent platforms and ensuring that the system is still capable of functioning when such a failure occurs.

Fig 2 and Fig 3 are sample screenshots of the visualisation tool developed to monitor the functioning of the system.

Fig 2. Visualisation tool demonstrating the three stages of processing to be undergone by each document. The blue lines illustrate which agents are in contact with one another in order to source jobs for processing
Fig 1. Visualisation Tool

Fig 3. Visualisation tool demonstrating system performace. Each line graph demonstrates the number of documents processed at each stage that have yet to be requested and processed by the subsequent stage, plotted against time.
Fig 1. Visualisation Tool

3. Associated Research

In addition to work on Multi Agent Systems, the HOTAIR project has also been responsible for the devleopment of the probFuse algorithm, which is a "data fusion" algorithm aimed at combining the outputs of multiple IR algorithms in order to achieve superior results. The Indexing Subsystem maintains multiple indices which are usable by several IR algorithms. ProbFuse weights the results output by each algorithm based on the quality of that algorithm's past performance.

4. Future Work

Future work on HOTAIR is intended to be largely in the domain of Autonomic Computing. This refers to the strain of research under which large-scale computer systems are created in such a way as to be capable of self-management. The aim of this initiative is to greatly reduce the amount of human interaction necessary to achieve optimal system performance. Key features of an Autonomic Computing system include self-awareness, self-configuration, self-optimisation, self-healing and self-protection.

In particular, the area of self-optimisation is of key interest. Foremostly, instilling in the system the ability to exploit and balance the utilisation of scarce computing resources such as processing power, memory and storage as demand dictates. This will involve the management of the output levels of each stage of the production line-style Indexing Subsystem, in order to ensure that new documents and updated versions of existing documents are included in the index as efficiently and as speedily as possibly. This must be balanced against the users' need for accurate search results to be returned to them almost instantaneously. Overemphasis on either subsystem would lead to the system failing to achieve its goals, either by driving away users with slow response times, or by failing to maintain a sufficiently large document collection for relevant information to be available for users's queries.

5. Selected Publications

  • D. Lillis, R. Collier, F. Toolan, and J. Dunnion. Evaluating communication strategies in a multi agent information retrieval system. In Proceedings of the 5th European Workshop on Multi-Agent Systems (EUMAS'07), Hammamet, Tunisia, 2007. [ PDF ]
  • D. Lillis, F. Toolan, A. Mur, L. Peng, R. Collier, and J. Dunnion. Probabilistic data fusion on a large document collection. Artificial Intelligence Review, 2007. [ PDF ]
  • D. Lillis, F. Toolan, R. Collier, and J. Dunnion. ProbFuse: a probabilistic approach to data fusion. In Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval, pages 139--146, New York, USA, 2006. ACM Press. [ PDF ]
  • D. Lillis, F. Toolan, A. Mur, L. Peng, R. Collier, and J. Dunnion. Probability-based fusion of information retrieval result sets. Artificial Intelligence Review, 25(1-2), 2006. [ PDF ]
  • L. Peng, R. Collier, A. Mur, D. Lillis, F. Toolan, and J. Dunnion. A self-configuring agent-based document indexing system. In Proceedings of the 4th International Central and Eastern European Conference on Multi-Agent Systems (CEEMAS 2005), Budapest, Hungary, 2005. Springer-Verlag GmbH. [ PDF ]