Test your programming skills

The task

A frequent task in Information Retrieval (IR) is the calculation of term frequencies. For all terms it is to be counted how often they occur in a text. For this a term is defined as the stem of a word. Examples:

word -> word stem (term)
going -> go
apple -> appl
apples -> appl

Within this task the document in question (the first scene of Shakespeare's Hamlet) is in XML format. Therefore first that file must be downladed and parsed. After this only the contents of the <LINE> is to be taken, meaning everything enclosed in <LINE>...</LINE>. From this the term frequencies (after stemming) are to be calculated. The output of the programm is a list of all terms, together with the respective occurrence frequencies within <LINE> elements. The output should look like this:

word count
go 7
appl 2
situat 5

It is recommend to implement term counting on plain text first, and afterwards extend the programm towards XML parsing.

Feel free to fullfill this task in your favourite language. Further resources for solving the problem in the most important languages are below. Feel free to contact us when you have questions concerning this tasks. If you want to let us check your results, please send us the code and the output of your running program.

Hints for Java

Parsing of XML can be done with Xerces. For word stem reduction there is a variant of the famous Porter Stemming Algorithm available. For counting the occurrence frequencies one can use java.lang.String.split or classes from the java.util.regex package and java.util.HashMap.