[ Home Page | Indexing Engine ]
Get this in PDF Get the Indexing Engine (.zip) Need Java 1.5.
Indexing Engine is an experiment in information retrieval, created in the fall of 2005 for Dr. David Johnson, CMSI 698. Indexing Engine is a complete system (search engine) for crawling, indexing and searching on the web. It provides tools to analyze: document content; indexing parsing and searching algorithms; and many levels of system performance.
Indexing Engine is a cross-platform Java* search engine implemented as a desktop application. It is an experiment in information retrieval. It provides all the features of a large-scale search engine without the scalability. This paper describes the anatomy of the engine, its features, future work and a comparison to other search engines currently in use.
*Indexing Engine requires Java JRE 1.5.
The first search engine for the World Wide Web was called ”Wandex” and was built in 1993 [sea]. In more recent times the explosion of the web has taken search engines to a new level of complexity. Today new content is placed on the web at a faster rate than any system can handle. The top three search engines in the world, Google, Yahoo Search and Microsoft, are the best systems made to date for indexing and searching large-scale document collections. These search engines are made from many fine tuned, distinct parts that must work together in order to search the constantly changing web. Minimizing operation speed and memory size within these distinct parts is a priority because of the immense scale of the task at hand—indexing and searching billions of pages. The algorithms and data structures must therefore be carefully chosen to best utilize the resources available.
At the highest level, the Indexing Engine is made up of the engine and the graphical user interface (GUI). The engine itself is divided into four main parts; the spider (Section 3.1.1), which manages crawling the web and retrieving pages; indexed documents (Section 3.1.2), which manage the content and parsing of web pages; the indexer (Section 3.1.3), which manages the indexed documents, the reverse index and searching on that index; the query parser (Section 3.1.4), which parses and builds a search out of a user query. The GUI (Section 3.1.7) allows a user to interface with the engine for monitoring and searching. It provides displays of all the statistics related to the engine as well as document content. Each of the four parts of the engine are described in more detail below.
Web crawling is a relatively slow process compared to other parts of the search engine system. For each web document a request must be sent out to a server and a reply containing the content of the document must be waited on. All communication must happen over the network which is significantly slower than other I/O transfers. To combat this issue, the spider in the Indexing Engine was built with a pool of concurrently executing threads, each of which handles its own web page and communication needs. As web page downloads complete and the documents are parsed (Section 3.1.2), the threads submit the parsed documents to the indexer for indexing and caching (Section 3.1.3). After completing a previous download, parse and submission, the thread pulls URLs awaiting download from a URL queue.
Parsing is the last phase of a web crawling thread before submission for indexing. In this phase the content of the document is examined—terms and URLs in the document content are realized. URLs that are found are verified and resubmitted to the spider. To verify a URL it must match a predefined regular expression and it must not have been previously downloaded or submitted. The terms in the document content are extracted based on two levels of predefined regular expressions. Each extracted term is recorded including all occurrences (frequency) and their locations (position). They are built into two tables, a term-to-frequency table and a term-to-position table.
The frequency is the number of times a term appears in the document and is a positive integer value. The position is a set of numbers, each of which is the starting position of the term based from the beginning of the document. These two tables alone are all that is required to identify and search for terms in a document. Other prominent information about the document is also recorded including: the originating URL; its parent URL; the time required to parse; the content encoding; content type and length; download, expiration and last modified dates and times of a page. These and all stats are provided by the document in its parsed form called an indexed document. From here the indexed document is sent to the indexer for indexing and caching.
The indexer manages indexed documents and builds a reverse index out of all the terms. As indexed documents are submitted to the indexer they are placed in a buffer queue from which the indexer pulls. This way the indexing of documents is never a bottleneck to the system. Documents pulled from the queue are processed—they are assigned a document number which is mapped to the originating URL. Their terms are extracted and added to the reverse index. The reverse index is a map from a term to the document numbers that contain that term. The indexer also tracks statistical data for all documents such as the average number of terms and URLs found per document, and the average time to index and parse a document. After indexing a document the indexer stores away the document to disk and pulls a new one from the queue.
Another major function of the indexer is searching and retrieving documents. Document retrieval in the indexer consists of locating and pulling documents from disk. Given a URL the document number can be mapped and the document retrieved. Searching with a given query consists of an algorithmic building of a document set that satisfies the query request. This process is described further in Section 3.1.4.
The search mechanism consists of a boolean search created from a query in conjunctive normal form. Conjunctive normal form requires that the terms be grouped as follows:
( v1 | v2 | ... | v3 ) & ( v4 | v5 | ... | v6 ) & (...)
where vn are terms grouped by | (or), called clauses, and the groups are separated by & (and).
Query parsing is implemented using a parser generator called JavaCC [jav]. The query language has reserved words and symbols as described in Section 3.1.5. The macrosyntax, written in Extended Backus Naur Form (EBNF) is described in Section 3.1.6.
The query language is small but provides all the mechanisms required for a boolean query. The macrosyntax consists of tokens which are reserved words, symbols and strings. A string is a case sensitive non-empty set of alphanumeric characters.
Using the query grammar, the query parser creates an expression representing the query. The expression is sent to the indexer and a document set that satisfies the expression is generated. Taking advantage of the the query in conjunctive normal form the indexer creates the document set as follows:
(1) For each term in a clause, if the term is not negated and it exists in the reverse index, add the documents to the clause-result set. If the term is negated, add all the documents in the reverse index to the clause-result set. Then if the negated term is in the reverse index, remove the documents associated with that term from the clause-result set.
(2) For each clause in the expression, intersect the clause-result sets to get final result set. Once the final result set containing document numbers has been generated, the document URLs for those document numbers are retrieved and returned as the results of the query.
Reserved words are case sensitive. Both reserved words and symbols are listed below.
and or not & | !
A syntactically valid query must be lexically valid and the tokenization must be derivable from the following grammar in EBNF.
QUERY --> EXP EXP --> EXP1 ( AND EXP1 )* EXP1 --> EXP2 | ‘(’ EXP2 ( OR EXP2 )+ ‘)’ EXP2 --> [ NOT ] STRING
The user interface consists of two control windows, two viewing windows and two help windows. The control windows are the main window and the spider window. The main window is comprised of five sections—the query/search area, the document collection list, the document viewer, the results list and the document information/statistics area.
The query/search area allows a query to be entered into a text field and a search will automatically be conducted on it displaying the results in the results list. It also includes a link to one of the two help files which explains how to create a query. The document collection list is a list of all the documents that are currently indexed by the indexer. It allows the user to select a document which is then viewed in the document viewer. Information and statistics for the document are viewed in the information/statistics area. The document viewer has a find feature built in that allows manual searches on a document as well. This can be used to verify the functionality of the parsing algorithm. There is a help file associated with the find feature for advanced searching options.
The spider window contains the starting URL address as well as the address parsing information. It has start and stop buttons to start and stop the crawling of the web pages. It displays the spiders current settings and statistics.
The two viewing windows are the indexer window and preferences window. The indexer window provides the current status of the indexer as well as indexing statistics. The preferences window is a portal to the default and preset system settings.
Listed here are the features that have been currently implemented in the Indexing Engine.
The original plan for the Indexing Engine was to be able to index the online encyclopedia, Wikipedia, which has approximately 800,000 pages. The scalability was by far the most challenging portion of creating the system. Three hundred documents is the maximum amount that the Indexing Engine has indexed in a session. It usually averages between 70 and 120 documents on the average run. At the top of the list of my future plans is increasing the scale to which the Indexing Engine can index web pages. The modular coding style will allow such scaleability to be built in.
Other future plans include building an HTML parser for parsing web pages and extracting terms. The current implementation with regular expressions is limited and does not capture all the terms in a document. The results can be compared by using the find feature after a search has been completed. Text highlighting in green are the indexed terms found and yellow is the find results. Another plan is to allow the user to change the starting URL and the regular expressions that define valid URLs. Also to allow the spider to restart after being stopped or to be saved and to have the indexer pull from disk a previously parsed document set.
All comparisons are based on Google during the development stage at Stanford [BP00]. The Indexing Engine is able to crawl up to eight pages per second. Googles system would crawl 25 pages per second (per crawler). To take into consideration the Google crawler ran on its own dedicated machine where as the Indexing Engine is running the entire system on one machine. With that aside, the Indexing Engine crawling capabilities are slower by a factor of three.
The Indexing Engine parser and indexer average about 20 pages per second while the dedicated machine running Googles indexer runs at roughly 54 pages per second—a factor of 2.7 times faster than the Indexing Engine.
Google estimates that queries take in between 1 and 10 seconds. Taking 5 as the average, on a lexicon of 14 million versus the Indexing Engine lexicon at 20 thousand, Google would take approximately 7 milliseconds to return a document on the smaller lexicon. The Indexing Engine takes between 1 and 31 milliseconds—an average of 15 milliseconds or a slower by a factor of two.
Indexing Engine is a complete search engine. First we looked at how the Indexing Engine was built including the major components that make up system. We introduced and described the query language for the system as well as a formal syntax definition. We introduced the user interface and described how it works related to the system. A bulleted list of features that are part of the Indexing Engine was provided. The challenges and future work were outlined.
Overall I am very satisfied with the out come of the system. It was thrilling to create a full fledged search engine from scratch—it is something I never thought I would get to do.
[BP00] Sergey Brin and Lawrence Page. The anatomy of a largescale hypertextual web search engine. 2000. http://wwwdb. stanford.edu/ backrub/google.html.
[jav] Javacc. https://javacc.dev.java.net/.
[sea] Search engine. http://en.wikipedia.org/wiki/.