PhDing ;-)

woensdag, juli 26, 2006

DB Query Research vs. IR Research

Sorry this really turned out to be "novel" instead of summary.

What surprised me a little at the beginning is that IR researchers seem not to care too much about performance. Being the most important type of benchmark agains other approaches in the DB world, I expected this to be true in IR as well. The explanation why it isn't, goes like this:

In DB research you have a query and a (sometimes semi-) structured dataset. You run queries on the data limiting the result to dataitems which qualify for the constraints given in the query. Given a specific dataset and query the outcome must be always deterministic. And can't be questioned by anymbody. That means that once you are sure that the query, you developed, is the semantically correct, it returns the the exact answer on your query (question) from the underlying data base. In that sense it is like the question answering discipline which extracts few possible answers to a (natural) question from a given source of documents (information base). Of course the difference is that question answering only guesses an answer which could be far from being correct/exact. The opinion about the answer being correct is also very subjective.

In IR systems the user formulates a query as well. The system returns a set of information sources (documents) which are likely to contain the information, which the user searched for in his query. The set of sources is most often ranked by the systems estimated likelyness that the source is relevant. Therefore it forms a list of results with the most relevant sources at the beginning. The problem with the presentation of information sources is that the user might find other documents more relevant. Or he would have done the ranking of the relevant documents in a different way.

From these facts stem the different focuses on the evaluation of the performance of a query. As in the DB field the correctness of a query result is assumed the focus lies on quantitiv measures - mainly the response (speed) time of the DBMS. In the IR approach speed also matters. What is much more important though is a reasonably good fulfillment of the users need for information. As this aspect differes from user to user for the same result a qualitative approach is much more important. It is in the first step of utmost importance that the need for information is optimally fullfilled. Only at the seconde step the responsetime of the systems plays a role.

Semistructured data: The increasing importance of semi structured data brought the DB discipline and the IR discipline closer together. This began with the SGML standard but realy took off with the wide usage of XML on the internet. The user can limit the parts of the data set in which he is searching for information by structural constrains. Structure could also be translated with adding context to the originial information (which had no explicit context before). In the case of XML the context is a keyword identifying the semantic of the data.

Multimedia Context: In Multimedia Context this could also be other things than semantics. Looking on video data the context of information might be the border of one shot to another. To generalize this, the context could be various sorts of events. They could be only one point in time (shot switch) or a time span (silent/noise detection).

Research Question: Given all these events/structures forming the context of a (multimedial) information source we want to find relevant (video-)documents to our query in a collection.
  1. The first question arrises when we look at storing the structure/events. The various events could happen interleaved. There could be a shot event during the duration of a noise event, which could be a commentator talking through out multiple shots. That means, we need a intelligent way of storing these events with fast access and low maintenance time.
  2. The next question could be how do we optimally exploit these structures to answer the queries. (i.e. what kind of index-structures should we use and kind of query processing model do we employ.
  3. And as a thrird question we will need to consider whether the available query lanuages (from DB / IR) are adquate to express the queries that fullfil the user's information need.