Kludging Text Parsing In Java

by Steven J. Owens (unless otherwise attributed)

Prasenjit writes:
>Well Steven I don't think StringTokenizer is going to help me in my
>development. For best understanding please have a look at the file which I
>want to parse, part of which is presented below. This is exactly the report
>we need to parse :

From the sound of it, you'll be using StringTokenizer, but that's just a small part of what you need to do.

I agree with the poster who said you need to check out regular expressions. I'm quite fond of them in emacs macros and perl scripts. I haven't yet used them in Java, but it sounds like they're your best bet. For example, using a regular expression, you could search for the headings by defining a search target as "Look for an all-uppercase line that starts at the left margin and ends with a colon" and the regexp would look like:


Where the ^ means beginning of the line, the $ means end of the line, the colon means a colon, the [A-Z] means "all characters in the range from uppercase A to uppercase B" the + means repeated occurrences of whatever is to the left of the + (in this case, [A-Z]).

Now as to how you use regular expressions in the larger sense, that depends. At the very least you can use them for the searching and tokenizing. You could even conceivably write some more complex regular expressions to do more of the work, but that's a lot to learn if you haven't done regexp before.

>As you have seen, I want to parse the above file for each and every type of
>data it has. If you see the dump summary above, it is basically a tabular
>form of data and I want to parse and extract each part of it. Well PERL is
>an option, but since I have exposure to Java, I just want to know how it
>might be done in Java. StringTokenizer with space as delimiter is of almost
>no help. Well space and newline can be used intricately, but it is much more
>difficult to parse data shown in statistics above.
>I want to display the data on webpages in some chart form.
>I hope it is clearer now. Can you provide me some help now ?

Hm, depending on what your constraints are, here's an off-the-cuff, quick and dirty approach. This is assuming things like, I can fit the whole file in memory at once, I can always count on all of the headings being there, I can always count on the whitespace being the same, etc:

1) read the whole damn file into a StringBuffer and convert it to a String. 2) use String.indexOf() to search for each of the headings specifically, and record the location of each heading 3) go back to the StringBuffer and use StringBuffer.substring(begin, end) to clip out each section to a separate String 4) now use String Tokenizer to parse the individual lines of each section.

Since you're assuming the whitespace is going to be the same on each run, when you call substring you can adjust the calls to substring() in step 3, to leave out the headers themselves and any leftover whitespace, etc. For example, let's say you do this on the sample data you posted.

Step 2 is going to tell you that:

FAILURE AND STRANGE etc, starts at character 0, of course, STATISTICS: starts at character 294 NOTES: starts at character 923 etc.

Now you could just substring(0, 293) as the first part and then substring(294, 922) as the second part, etc. However, you know that "FAILURE etc" is 15 characters long (including the newline). So you can substring(15, 293) and avoid having to deal with cutting that out. You also know that every section ends with three newline characters, so you can substring(15, 293) to avoid having to worry about the whitespace at the end.

Likewise, "STATISTICS:\n" is twelve characters long, but you'd like to get rid of the column headers: Total Full Daily

So if you add the length of all of that, it's 121 characters long, so you substring(293 + 121, 923 - 3) or substring(414, 920).

Now each section consists of just the actual lines of data, and that part you can handle with StringTokenizer. There's not a whole lot more I can do here, short of writing the code for you. Why don't you try putting something together and see how it goes. You can drop me a line if you get stuck. Have fun!

See original (unformatted) article


Verification Image:
Your Email Address:
Confirm Address:
Please Post:
Copyright: By checking the "Please Post" checkbox you agree to having your feedback posted on notablog if the administrator decides it is appropriate content, and grant compilation copyright rights to the administrator.
Message Content: