Preprocessing

In this chapter we will study all these processes by which a document has to pass before being introduced into a retrieval system. These transformations, referred to, with the term preprocessing of documents and include:

the crawling of documents,
reading and cleaning of HTML annotations, remove any graphics and mathematical expressions,
Transform documents to plain text,
Tokenization,
stemming,
assign annotations (metadata),
assign weights to terms

Each of these processes, as we shall see, is a large and complex program (project). Our purpose here is to raise only the issues of complexity of these processes and record for each one the difficulties one, has to face in a real application. Before proceeding, however, with their description we shall give an overview of the architecture of a retrieval system in order to get a better understanding of the contribution of each of the above transformations.