Welcome back; let's get on with the implementation of the AbstractBuilder class.
AbstractBuilder implementation
Here's the data part of an AbstractBuilder:
Expand|Select|Wrap|Line Numbers
- protected String title;
- protected WordMapBuilder wordMap = new WordMapBuilder();
- protected SectionsBuilder groups = new SectionsBuilder();
- protected SectionsBuilder books = new SectionsBuilder();
- protected SectionsBuilder chapters = new SectionsBuilder();
- protected StringsBuilder paragraphs= new StringsBuilder();
- protected StringsBuilder words = new StringsBuilder();
- protected String[] punctuation = { "", "" };
The String[] punctuation stores simply two Strings. Each character in each
String is a punctuation character from the text being fed to the AbstractBuilder
object. The first String contains the punctuation characters followed by a
space; the second String contains the punctuation characters not followed by
a space. The AbstractBuilder adds new punctuation characters to each of the
Strings when needed.
Recall that words are stored only once; for every word in a paragraph just
its index in a list is stored. Index values start at 0x0100 (see previous
part of this article). So if a words is stored at location 'i' in a list,
its index equals 'i+0x0100'. These index values are perfectly legal characters,
so a paragraph can be stored in a String.
The following method uses the notion described above:
Expand|Select|Wrap|Line Numbers
- private char getWord(String word) throws IOException {
- WordBuilder w= wordMap.get(word);
- int i;
- if (w == null) {
- if ((i= words.size()+0x100) > 0xffff)
- throw new LibException("too many different words");
- wordMap.put(word, w= new WordBuilder(i));
- words.add(word);
- }
- else
- i= w.getIndex();
- w.addParagraph(paragraphs.size());
- return (char)i;
- }
wordMap already contains the maximum number of unique words, the method fails.
Otherwise a new WordBuilder is stored in the map with its new index.
If the word was found, its index is retrieved from the found WordBuilder.
Finally the paragraph number (which is the current paragraph being fed to
the AbstractBuilder) is added to the WordBuilder. This private method is used
by the text compression method.
Here are the first few methods from the implemented LibBuilder interface;
they are trivial:
Expand|Select|Wrap|Line Numbers
- public void preProcess() { }
- public void setTitle(String title) { this.title= title; }
- public void buildGroup(String group) {
- groups.addSection(group, books.size(), false);
- }
need preprocessing this method can be overridden of course.
The next method basically does all the work: it receives a paragraph of text
and has to deal with it. Here it is:
Expand|Select|Wrap|Line Numbers
- public void buildParagraph(String book, String chapter, int paragraph, String text) throws IOException {
- boolean add= books.addSection(book, chapters.size(), false);
- chapters.addSection(chapter, paragraphs.size(), add);
- text= normalizeSpace(text);
- text= normalizeControl(text);
- text= clean(book, chapter, paragraph, text);
- text= normalizePunctuation(text);
- text= compress(text);
- paragraphs.add(text);
- }
same, no new book is added and the method returns false; otherwise the method
returned true indicating that a new book was added. If so, a new chapter needs
to be added no matter what. You can see that all happen in the first two lines
of the method.
The other lines clean up the paragraph text bit by bit: first adjacent spaces
are 'normalized' to a single space after all other possible space characters
have been transformed to the ASCII space character (codepoint == 0x0020).
Next all control characters are removed from the paragraph text (such as ^Z).
The clean() method needs to be implemented by an extending class. We have seen
an example of this earlier in this article part. The normalizePunctuation()
method takes care of punctuation characters described above.
The compress method compresses the paragraph text. Here it is:
Expand|Select|Wrap|Line Numbers
- protected boolean isLetter(int c) {
- return Character.isLetter(c) || Character.isDigit(c);
- }
- protected String compress(String text) throws IOException {
- StringBuilder sb= new StringBuilder();
- StringBuilder wb= new StringBuilder();
- for (int i= 0, n= text.length(); i < n; i++) {
- char c= text.charAt(i);
- if (isLetter(c))
- wb.append(c);
- else {
- if (wb.length() > 0) {
- sb.append(this.getWord(wb.toString()));
- wb.setLength(0);
- }
- if (c == ' ') continue;
- sb.append(c);
- if (c < ' ') i++;
- }
- }
- if (wb.length() > 0)
- sb.append(this.getWord(wb.toString()));
- return sb.toString();
- }
method uses another simple method: isLetter(). This method determines which
characters should be considered a letter. Any letter in the Unicode set as
well as any digit in the Unicode set is considered a letter in the paragraph
text.
The compress method has to deal with letters, spaces and punctuation characters
that were already transformed to characters in the range 0x0000, 0x001f for
punctuation characters followed by a space and the range 0x0080, 0x009f for
punctuation characters not followed by a space.
If a word is found its index is looked up in the wordMap and the index is
appended to the compressed String. Punctuation characters are appended as is
to the compressed String. Spaces are simply ignored. If a paragraph ends with
a word the last check will find it and append the index of that word to the
compressed String again.
Finally, the compressed paragraph String is returned and it will be appended
to the paragraph String list.
When all the paragraph text has been fed to the AbstractBuilder, some post-
processing needs to be done: the last Sections for the groups, books and
chapters don't know yet that no more items will be added to them, i.e. their
'total' member variable needs to be set. This is how it's done:
Expand|Select|Wrap|Line Numbers
- public void postProcess() {
- groups.postProcess(books.size());
- books.postProcess(chapters.size());
- chapters.postProcess(paragraphs.size());
- removeNoise();
- }
the correct values. Finally the noise words are removed from the wordMap.
Here is the implementation:
Expand|Select|Wrap|Line Numbers
- private void removeNoise() {
- Collection<String> noise= getNoise();
- if (noise != null)
- for (String word : noise)
- wordMap.remove(word);
- }
exists, all the words in the Collection are removed from the wordMap. Note
that both the key,value pairs are removed so the WordMapBuilders that contain
the index lists for the words are also removed.
After post processing, a builder can be asked to build the end result; we want
a Library; here's the build() implementation from the interface:
Expand|Select|Wrap|Line Numbers
- public Library build() {
- return new Library(title,
- wordMap.build(),
- groups.build(),
- books.build(),
- chapters.build(),
- paragraphs.build(),
- words.build(),
- punctuation);
- }
passed to the Library constructor that will populate its member variables.
The title String and both the punctuation Strings are passed too.
Concluding remarks
In the previous article part we have built the Processor that reads raw text
and feeds paragraphs to a LibraryBuilder. This part of the article descibed
an implementation of such a LibraryBuilder. Such a builder can produce a
Library. The implementation of such a Library will be the subject of the
next part of this article.
A library is a Serializable object that contains a complete text. A library
itself can just reproduce (parts of the) text. It can reproduce an entire
book text, or just a single chapter from a book. It also is capable of
producing 'BookMarks' which are the bridge to a more 'intelligent' text
retrieval method: the Query. Queries are the subject of the last parts of
this article.
I hope you got this far, it was quite a bit of reading and understanding what
the structure of this all is, and I also hope to see you again next week.
kind regards,
Jos