473,320 Members | 1,823 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes and contribute your articles to a community of 473,320 developers and data experts.

Text retrieval systems - 2B: Text Processors

11,448 Expert 8TB
Greetings,

Introduction

welcome back. It's time to do some real design: I want two have two 'things':

1) a 'LibraryBuilder' that gradually builds the processed text and finally
builds the 'Library' itself.

2) a 'Processor' that processes the input text and spoonfeeds it to the first
object.

Processor

I want these two entities to be as general as possible. First I design the
wanted interfaces. Here is the Processor interface:

Expand|Select|Wrap|Line Numbers
  1. public interface Processor {
  2.  
  3.     public void process(String prefix) throws IOException;
  4.  
  5.     public Library getLibrary();
  6. }
  7.  
The prefix String can be any string; the Processor knows what to do with it.
The prefix can be a uri or a directory or whatever is needed to get to the
raw text.

The process() method does all the processing; and because things can go wrong
during processing it is allowed to throw an IOException which is the most
likely exception that can be thrown. I'll design sub classes thereof when
needed.

The second method gives me the end result: the Library. A Library is an ordinary
class that can retrieve text for me.

I want a Processor implementation to be as generic as possible, i.e. I don't
want to stick any particular King James bible knowledge into my Processor.

The Processor implementation will be an abstract class that does all the
organizational or 'conducting' work and leaves the particular King James bible
knowledge to a subclass. It implements abstract methods for that purpose.

LibraryBuilder

I use the same scenario for a LibraryBuilder:

Expand|Select|Wrap|Line Numbers
  1. public interface LibraryBuilder {
  2.  
  3.     public void preProcess();
  4.     public void postProcess();
  5.  
  6.     public void setTitle(String title);
  7.  
  8.     public void buildGroup(String group);
  9.     public void buildParagraph(String book, String chapter, 
  10.                    int para, String text) throws IOException;
  11.  
  12.     public Library build();
  13. }
  14.  
The interface can't enforce it, but the intention is to call the preProcess()
method before anything else is done. After all processing is over and done
with, the postProcess() method is supposed to be called.

At the very end the LibraryBuilder is supposed to give me a Library object
when its build() method is invoked.

The Library class itself doesn't know which text it handles, i.e. it knows
nothing about King James bible texts, nor about CD collections or whatever.

The two remaining methods implement the text spoonfeeding:

1) buildGroup() builds a new group for its caller.
2) buildParagraph() builds a new paragraph given a book, chapter, paragraph
and the raw paragraph text. It may throw an IOException if needed.

As already can be seen, when the Processor.getLibrary() method is invoked it
delegates the job to the LibraryBuilder.build() method.

Here too, I want a LibraryBuilder to be as generic as possible so I implement
an abstract class that implements the LibraryBuilder interface. This class does
all the work that doesn't need any particular knowledge about the King James
bible text.

A special subclass should implement the abstract methods defined in the abstract
super class in which it can stick its specific King James bible text.

Class structure

This is the top level class structure:

Expand|Select|Wrap|Line Numbers
  1. // interfaces:
  2. interface Processor { ... }
  3. interface LibraryBuilder { ... }
  4. // implementing classes:
  5. abstract class AbstractProcessor() implements Processor { ... }
  6. abstract class AbstractBuilder() implements LibraryBuilder { ... }
  7.  
For this particular example project I have to implement two specific classes:

Expand|Select|Wrap|Line Numbers
  1. class KJProcessor extends AbstractProcessor { ... }
  2. class KJBuilder extends AbstractBuillder { ... }
  3.  
These two classes contain specific King James bible knowledge about the text
being processed and from which a Library is constructed by the builder.

AbstractProcessor

The AbstractProcessor does all the 'conducting' work for the raw text processing
job. It needs to be subclassed for the real job. Here is its first part:

Expand|Select|Wrap|Line Numbers
  1. public abstract class AbstractProcessor implements Processor {
  2.  
  3.     protected LibraryBuilder builder;
  4.     protected String title;
  5.  
  6.     public AbstractProcessor(String title, LibraryBuilder builder) {
  7.  
  8.         this.builder= builder;
  9.         this.title= title;
  10.     }
  11.     ...
  12.  
An AbstractProcessor can be constructed given the title of the library and
a LibraryBuilder. The KJProcessor supplies the KBBuilder for its superclass
as well as the title String.

The abstract methods defined in this class are:

Expand|Select|Wrap|Line Numbers
  1.     ...
  2.     protected abstract void preProcess();
  3.     protected abstract void postProcess();
  4.  
  5.     protected abstract int getNofBooks();
  6.     protected abstract String getBookTitle(String prefix, int book);
  7.  
  8.     protected abstract Reader getBookReader(String prefix, int book) 
  9.                 throws IOException;
  10.  
  11.     protected abstract void processBook(String title, BufferedReader br) 
  12.                 throws IOException;
  13.     ...
  14.  
Similar to the LibraryBuilder this object calls the preProcess() method before
processing starts. When the processing is done the postProcess() method is
invoked. The KJProcessor implements empty methods for these two abstract methods
because it doesn't need to do any special pre- or post processing.

The AbstractProcessor needs to know how many books are to be processed and it
needs the title of each book. That's what the next two methods are for and
they need to be implemented in a subclass of the AbstractProcessor class.

The getBookReader() method needs to return a Java Reader object that can read
from a book. The last method must process an entire book, given a Reader for
that book.

The last two methods can throw an IOException because anything input/output
related actions can go wrong.

Note that the subclass can invoke methods and read or alter member variables
in the builder directly, i.e. the coupling between the two is tight.

Here's the delegator method when a Library object is wanted:

Expand|Select|Wrap|Line Numbers
  1.     ...
  2.     public Library getLibrary() { return builder.build(); }
  3.     ...
  4.  
Also see above: the AbstractProcessor simply invokes the builder.build() method
for the Library.

Now for some substantial conducting work. The next method in the AbstractProcessor
class is the implementation of the process() method defined in the Processor
interface:

Expand|Select|Wrap|Line Numbers
  1.     ...
  2.     public void process(String prefix) throws IOException {
  3.  
  4.         builder.preProcess();
  5.         builder.setTitle(title);
  6.  
  7.         this.preProcess();
  8.  
  9.         for (int i= 0, n= getNofBooks(); i < n; i++)
  10.             processBook(prefix, i);
  11.  
  12.         this.postProcess();
  13.  
  14.         builder.postProcess();
  15.     }
  16.     ...
  17.  
It calls the preProcess() methods on the builder and the subclass and it
passes the title to the builder.

Next it determines the number of books to be processed and processes each
book by invoking the processBook() method (see below).

When everything succeeds the postProcess() method is invoked on both the
subclass and the builder.

Here's the processBook() method implementation:

Expand|Select|Wrap|Line Numbers
  1.     ...
  2.     public void processBook(String prefix, int book) throws IOException {
  3.  
  4.         BufferedReader br= null;
  5.  
  6.         try {
  7.             br= new BufferedReader(getBookReader(prefix, book));
  8.             processBook(getBookTitle(prefix, book), br);
  9.         }
  10.         finally {
  11.             try { br.close(); } catch (IOException ioe) { }
  12.         }
  13.     }
  14.  
This methods asks the subclass to return a Reader given a book. It wraps
a BufferedReader around the Reader and asks the subclass again to process
the current book. Finally the buffered reader is closed again, which closes
the wrapped reader itself.

I think this is enough design and implementation for this week. Next week I'll
show how the LibraryBuilder is designed and implemented. It's more work than
this Processor implementation.

After that I'll show the KJProcessor and KJBuilder classes; they handle the
nitty-gritty String processing work and are basically the implementations
of the abstract methods defined in their parent classes and a few ugly methods
that must come up with consistent text (see the last week's article part).

I'll add all the code as an attachment in some of the following article parts so
you can play with it or maybe actually apply it in a useful way. It doesn't
hurt to actually read the source code. If you find bugs feel free to correct me.

See you next week and

kind regards,

Jos
Jul 13 '07 #1
0 3431

Sign in to post your reply or Sign up for a free account.

Similar topics

0
by: SoftComplete Development | last post by:
AlphaTIX is a powerful, fast, scalable and easy to use Full Text Indexing and Retrieval library that will completely satisfy your application's indexing and retrieval needs. AlphaTIX indexing...
16
by: Ioannis Vranos | last post by:
Since multicore processors are about to become mainstream soon, multithreading will become a main concern too. However I am thinking that perhaps for small/medium-sized applications...
0
by: JosAH | last post by:
Greetings, Introduction At the end of the last Compiler article part I stated that I wanted to write about text processing. I had no idea what exactly to talk about; until my wife commanded...
0
by: JosAH | last post by:
Greetings, Introduction Last week I started thinking about a text processing facility. I already found a substantial amount of text: a King James version of the bible. I'm going to use that...
0
by: JosAH | last post by:
Greetings, Introduction Before we start designing and implementing our text builder class(es), I'd like to mention a reply by Prometheuzz: he had a Dutch version of the entire bible ...
0
by: JosAH | last post by:
Greetings, the last two article parts described the design and implementation of the text Processor which spoonfeeds paragraphs of text to the LibraryBuilder. The latter object organizes, cleans...
0
by: JosAH | last post by:
Greetings, Introduction At this moment we have a TextProcessor, a LibraryBuilder as well as the Library itself. As you read last week a Library is capable of producing pieces of text in a...
1
by: JosAH | last post by:
Greetings, Introduction This week we start building Query objects. A query can retrieve portions of text from a Library. I don't want users to build queries by themselves, because users make...
0
by: JosAH | last post by:
Greetings, welcome back; above we discussed the peripherals of the Library class: loading and saving such an instantiation of it, the BookMark interface and then some. This part of the article...
20
by: =?ISO-8859-1?Q?Tom=E1s_=D3_h=C9ilidhe?= | last post by:
There are a few guarantees I exploit in the C Standard. For instance, I might write (unsigned)-1 to get the maximum value for an unsigned integer. Also, I might rely on things such as: ...
0
by: DolphinDB | last post by:
The formulas of 101 quantitative trading alphas used by WorldQuant were presented in the paper 101 Formulaic Alphas. However, some formulas are complex, leading to challenges in calculation. Take...
0
by: DolphinDB | last post by:
Tired of spending countless mintues downsampling your data? Look no further! In this article, you’ll learn how to efficiently downsample 6.48 billion high-frequency records to 61 million...
0
by: ryjfgjl | last post by:
ExcelToDatabase: batch import excel into database automatically...
0
by: Vimpel783 | last post by:
Hello! Guys, I found this code on the Internet, but I need to modify it a little. It works well, the problem is this: Data is sent from only one cell, in this case B5, but it is necessary that data...
0
by: jfyes | last post by:
As a hardware engineer, after seeing that CEIWEI recently released a new tool for Modbus RTU Over TCP/UDP filtering and monitoring, I actively went to its official website to take a look. It turned...
0
by: Defcon1945 | last post by:
I'm trying to learn Python using Pycharm but import shutil doesn't work
0
by: Shællîpôpï 09 | last post by:
If u are using a keypad phone, how do u turn on JavaScript, to access features like WhatsApp, Facebook, Instagram....
0
by: af34tf | last post by:
Hi Guys, I have a domain whose name is BytesLimited.com, and I want to sell it. Does anyone know about platforms that allow me to list my domain in auction for free. Thank you
0
by: Faith0G | last post by:
I am starting a new it consulting business and it's been a while since I setup a new website. Is wordpress still the best web based software for hosting a 5 page website? The webpages will be...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.