Test your programming skills
The task
A frequent task in Information Retrieval (IR) is the calculation of term frequencies. For all terms it is to be counted how often they occur in a text. For this a term is defined as the stem of a word. Examples:
word | -> | word stem (term) |
going | -> | go |
apple | -> | appl |
apples | -> | appl |
Within this task the document in question (the first scene of Shakespeare's Hamlet) is
in XML format. Therefore first that file must be downladed and
parsed. After this only the contents of the
<LINE>
is to be taken, meaning everything
enclosed in <LINE>...</LINE>
. From
this the term frequencies (after stemming) are to be
calculated. The output of the programm is a list of all terms,
together with the respective occurrence frequencies within
<LINE>
elements. The output should look
like this:
word | count |
go | 7 |
appl | 2 |
situat | 5 |
It is recommend to implement term counting on plain text first, and afterwards extend the programm towards XML parsing.
Feel free to fullfill this task in your favourite language. Further resources for solving the problem in the most important languages are below. Feel free to contact us when you have questions concerning this tasks. If you want to let us check your results, please send us the code and the output of your running program.
Hints for Java
Parsing of XML can be done with Xerces. For word stem reduction
there is a variant of the famous Porter Stemming Algorithm
available. For counting the occurrence frequencies one can use
java.lang.String.split
or classes from the java.util.regex
package and
java.util.HashMap
.