473,387 Members | 1,700 Online
Bytes | Software Development & Data Engineering Community
Post Job

Home Posts Topics Members FAQ

Join Bytes and contribute your articles to a community of 473,387 developers and data experts.

Text retrieval systems - 3B: the Library Builder

11,448 Expert 8TB
Greetings,

Welcome back; let's get on with the implementation of the AbstractBuilder class.

AbstractBuilder implementation

Here's the data part of an AbstractBuilder:

Expand|Select|Wrap|Line Numbers
  1. protected String title;
  2.  
  3. protected WordMapBuilder wordMap   = new WordMapBuilder();
  4.  
  5. protected SectionsBuilder groups   = new SectionsBuilder();
  6. protected SectionsBuilder books    = new SectionsBuilder();
  7. protected SectionsBuilder chapters = new SectionsBuilder();
  8.  
  9. protected StringsBuilder paragraphs= new StringsBuilder();
  10. protected StringsBuilder words     = new StringsBuilder();
  11.  
  12. protected String[] punctuation     = { "", "" };
  13.  
The String title and the several builders don't need any further explanation.
The String[] punctuation stores simply two Strings. Each character in each
String is a punctuation character from the text being fed to the AbstractBuilder
object. The first String contains the punctuation characters followed by a
space; the second String contains the punctuation characters not followed by
a space. The AbstractBuilder adds new punctuation characters to each of the
Strings when needed.

Recall that words are stored only once; for every word in a paragraph just
its index in a list is stored. Index values start at 0x0100 (see previous
part of this article). So if a words is stored at location 'i' in a list,
its index equals 'i+0x0100'. These index values are perfectly legal characters,
so a paragraph can be stored in a String.

The following method uses the notion described above:

Expand|Select|Wrap|Line Numbers
  1. private char getWord(String word) throws IOException {
  2.  
  3.     WordBuilder w= wordMap.get(word);
  4.     int  i;
  5.  
  6.     if (w == null) {
  7.         if ((i= words.size()+0x100) > 0xffff)
  8.             throw new LibException("too many different words");
  9.  
  10.         wordMap.put(word, w= new WordBuilder(i));
  11.         words.add(word);
  12.     }
  13.     else
  14.         i= w.getIndex();
  15.  
  16.     w.addParagraph(paragraphs.size());
  17.  
  18.     return (char)i;
  19. }
  20.  
The method tries to find a word in the wordMap; if it wasn't found and the
wordMap already contains the maximum number of unique words, the method fails.
Otherwise a new WordBuilder is stored in the map with its new index.

If the word was found, its index is retrieved from the found WordBuilder.
Finally the paragraph number (which is the current paragraph being fed to
the AbstractBuilder) is added to the WordBuilder. This private method is used
by the text compression method.

Here are the first few methods from the implemented LibBuilder interface;
they are trivial:

Expand|Select|Wrap|Line Numbers
  1. public void preProcess() { }
  2.  
  3. public void setTitle(String title) { this.title= title; }
  4.  
  5. public void buildGroup(String group) {
  6.  
  7.     groups.addSection(group, books.size(), false);
  8. }
  9.  
No preprocessing is needed for the AbstractBuilder; if an extending class does
need preprocessing this method can be overridden of course.

The next method basically does all the work: it receives a paragraph of text
and has to deal with it. Here it is:

Expand|Select|Wrap|Line Numbers
  1. public void buildParagraph(String book, String chapter, int paragraph, String text) throws IOException {
  2.  
  3.     boolean add= books.addSection(book, chapters.size(), false);
  4.     chapters.addSection(chapter, paragraphs.size(), add);
  5.  
  6.     text= normalizeSpace(text);
  7.     text= normalizeControl(text);
  8.     text= clean(book, chapter, paragraph, text);
  9.     text= normalizePunctuation(text);
  10.     text= compress(text);
  11.  
  12.     paragraphs.add(text);
  13. }
  14.  
First the book is added to the book SectionsBuilder; if the book name is the
same, no new book is added and the method returns false; otherwise the method
returned true indicating that a new book was added. If so, a new chapter needs
to be added no matter what. You can see that all happen in the first two lines
of the method.

The other lines clean up the paragraph text bit by bit: first adjacent spaces
are 'normalized' to a single space after all other possible space characters
have been transformed to the ASCII space character (codepoint == 0x0020).

Next all control characters are removed from the paragraph text (such as ^Z).
The clean() method needs to be implemented by an extending class. We have seen
an example of this earlier in this article part. The normalizePunctuation()
method takes care of punctuation characters described above.

The compress method compresses the paragraph text. Here it is:

Expand|Select|Wrap|Line Numbers
  1. protected boolean isLetter(int c) {
  2.  
  3.     return Character.isLetter(c) || Character.isDigit(c);
  4. }
  5.  
  6. protected String compress(String text) throws IOException {
  7.  
  8.     StringBuilder sb= new StringBuilder();
  9.     StringBuilder wb= new StringBuilder();
  10.  
  11.     for (int i= 0, n= text.length(); i < n; i++) {
  12.  
  13.         char c= text.charAt(i);
  14.         if (isLetter(c))
  15.             wb.append(c);
  16.         else {
  17.             if (wb.length() > 0) {
  18.                 sb.append(this.getWord(wb.toString()));
  19.                 wb.setLength(0);
  20.             }
  21.  
  22.             if (c == ' ') continue;
  23.  
  24.             sb.append(c);
  25.  
  26.             if (c < ' ') i++;
  27.         }        
  28.     }
  29.  
  30.     if (wb.length() > 0)
  31.         sb.append(this.getWord(wb.toString()));        
  32.  
  33.     return sb.toString();
  34. }    
  35.  
The parameter of this method is the paragraph text, nicely cleaned up. The
method uses another simple method: isLetter(). This method determines which
characters should be considered a letter. Any letter in the Unicode set as
well as any digit in the Unicode set is considered a letter in the paragraph
text.

The compress method has to deal with letters, spaces and punctuation characters
that were already transformed to characters in the range 0x0000, 0x001f for
punctuation characters followed by a space and the range 0x0080, 0x009f for
punctuation characters not followed by a space.

If a word is found its index is looked up in the wordMap and the index is
appended to the compressed String. Punctuation characters are appended as is
to the compressed String. Spaces are simply ignored. If a paragraph ends with
a word the last check will find it and append the index of that word to the
compressed String again.

Finally, the compressed paragraph String is returned and it will be appended
to the paragraph String list.

When all the paragraph text has been fed to the AbstractBuilder, some post-
processing needs to be done: the last Sections for the groups, books and
chapters don't know yet that no more items will be added to them, i.e. their
'total' member variable needs to be set. This is how it's done:

Expand|Select|Wrap|Line Numbers
  1. public void postProcess() {
  2.  
  3.     groups.postProcess(books.size());
  4.     books.postProcess(chapters.size());
  5.     chapters.postProcess(paragraphs.size());
  6.     removeNoise();
  7. }
  8.  
The postProcess() method of the SectionsBuilders (see above) is invoked with
the correct values. Finally the noise words are removed from the wordMap.
Here is the implementation:

Expand|Select|Wrap|Line Numbers
  1. private void removeNoise() {
  2.  
  3.     Collection<String> noise= getNoise();
  4.  
  5.     if (noise != null)
  6.         for (String word : noise)
  7.             wordMap.remove(word);
  8. }
  9.  
This method asks its subclass for a Collection of Strings. If the Collection
exists, all the words in the Collection are removed from the wordMap. Note
that both the key,value pairs are removed so the WordMapBuilders that contain
the index lists for the words are also removed.

After post processing, a builder can be asked to build the end result; we want
a Library; here's the build() implementation from the interface:

Expand|Select|Wrap|Line Numbers
  1. public Library build() {
  2.  
  3.     return new Library(title,
  4.                wordMap.build(), 
  5.                groups.build(),
  6.                books.build(), 
  7.                chapters.build(), 
  8.                paragraphs.build(), 
  9.                words.build(), 
  10.                punctuation);
  11. }
  12.  
All the embedded builders are asked to build() their result; everything is
passed to the Library constructor that will populate its member variables.
The title String and both the punctuation Strings are passed too.

Concluding remarks

In the previous article part we have built the Processor that reads raw text
and feeds paragraphs to a LibraryBuilder. This part of the article descibed
an implementation of such a LibraryBuilder. Such a builder can produce a
Library. The implementation of such a Library will be the subject of the
next part of this article.

A library is a Serializable object that contains a complete text. A library
itself can just reproduce (parts of the) text. It can reproduce an entire
book text, or just a single chapter from a book. It also is capable of
producing 'BookMarks' which are the bridge to a more 'intelligent' text
retrieval method: the Query. Queries are the subject of the last parts of
this article.

I hope you got this far, it was quite a bit of reading and understanding what
the structure of this all is, and I also hope to see you again next week.

kind regards,

Jos
Jul 22 '07 #1
1 3805
Hi JosAH,
Did you put the complete code somewhere and I could not get it or you have not attached the complete code yet.
Aug 4 '11 #2

Sign in to post your reply or Sign up for a free account.

Similar topics

6
by: hplloyd | last post by:
I am using some code off the web that requires a string builder in a script within a web page. In order to get the script to work I need to include the relevant library in the references section...
0
by: SoftComplete Development | last post by:
AlphaTIX is a powerful, fast, scalable and easy to use Full Text Indexing and Retrieval library that will completely satisfy your application's indexing and retrieval needs. AlphaTIX indexing...
0
by: JosAH | last post by:
Greetings, Introduction At the end of the last Compiler article part I stated that I wanted to write about text processing. I had no idea what exactly to talk about; until my wife commanded...
0
by: JosAH | last post by:
Greetings, Introduction Last week I started thinking about a text processing facility. I already found a substantial amount of text: a King James version of the bible. I'm going to use that...
0
by: JosAH | last post by:
Greetings, Introduction Before we start designing and implementing our text builder class(es), I'd like to mention a reply by Prometheuzz: he had a Dutch version of the entire bible ...
0
by: JosAH | last post by:
Greetings, the last two article parts described the design and implementation of the text Processor which spoonfeeds paragraphs of text to the LibraryBuilder. The latter object organizes, cleans...
0
by: JosAH | last post by:
Greetings, Introduction At this moment we have a TextProcessor, a LibraryBuilder as well as the Library itself. As you read last week a Library is capable of producing pieces of text in a...
1
by: JosAH | last post by:
Greetings, Introduction This week we start building Query objects. A query can retrieve portions of text from a Library. I don't want users to build queries by themselves, because users make...
0
by: JosAH | last post by:
Greetings, Introduction Last week I was a bit too busy to cook up this part of the article series; sorry for that. This article part wraps up the Text Processing article series. The ...
0
by: JosAH | last post by:
Greetings, welcome back; above we discussed the peripherals of the Library class: loading and saving such an instantiation of it, the BookMark interface and then some. This part of the article...
0
by: taylorcarr | last post by:
A Canon printer is a smart device known for being advanced, efficient, and reliable. It is designed for home, office, and hybrid workspace use and can also be used for a variety of purposes. However,...
0
by: Charles Arthur | last post by:
How do i turn on java script on a villaon, callus and itel keypad mobile phone
0
by: ryjfgjl | last post by:
If we have dozens or hundreds of excel to import into the database, if we use the excel import function provided by database editors such as navicat, it will be extremely tedious and time-consuming...
0
BarryA
by: BarryA | last post by:
What are the essential steps and strategies outlined in the Data Structures and Algorithms (DSA) roadmap for aspiring data scientists? How can individuals effectively utilize this roadmap to progress...
1
by: Sonnysonu | last post by:
This is the data of csv file 1 2 3 1 2 3 1 2 3 1 2 3 2 3 2 3 3 the lengths should be different i have to store the data by column-wise with in the specific length. suppose the i have to...
0
by: Hystou | last post by:
There are some requirements for setting up RAID: 1. The motherboard and BIOS support RAID configuration. 2. The motherboard has 2 or more available SATA protocol SSD/HDD slots (including MSATA, M.2...
0
marktang
by: marktang | last post by:
ONU (Optical Network Unit) is one of the key components for providing high-speed Internet services. Its primary function is to act as an endpoint device located at the user's premises. However,...
0
by: Hystou | last post by:
Most computers default to English, but sometimes we require a different language, especially when relocating. Forgot to request a specific language before your computer shipped? No problem! You can...
0
jinu1996
by: jinu1996 | last post by:
In today's digital age, having a compelling online presence is paramount for businesses aiming to thrive in a competitive landscape. At the heart of this digital strategy lies an intricately woven...

By using Bytes.com and it's services, you agree to our Privacy Policy and Terms of Use.

To disable or enable advertisements and analytics tracking please visit the manage ads & tracking page.