IP19 Report template
Author
Troels Bjerre Lund and Martin Aumüller
Last Updated
5 years ago
License
Creative Commons CC BY 4.0
Abstract
Template for exam project in Introductory Programming at the IT University of Copenhagen for the fall of 2019
Template for exam project in Introductory Programming at the IT University of Copenhagen for the fall of 2019
\documentclass[11pt]{article}
% Character set for input and output
\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}
% Fonts
\usepackage{libertine}
\usepackage[scaled=0.83]{beramono}
% AMS math-packages
\usepackage{amssymb}
\usepackage{amsmath,amsthm}
% TODO in text
\usepackage{todonotes}
% URL (clickable), references within document, etc./
\usepackage{hyperref}
% Code formatting
\usepackage{listings}
\lstset{
language=java,
extendedchars=true,
basicstyle=\footnotesize\ttfamily,
showstringspaces=false,
showspaces=false,
numbers=left,
numberstyle=\footnotesize,
numbersep=9pt,
tabsize=2,
breaklines=true,
showtabs=false,
frame=single,
extendedchars=false,
inputencoding=utf8,
captionpos=b
}
% Text / Paragraph space
\addtolength{\textwidth}{1.5cm}
\addtolength{\hoffset}{-0.5cm}
\setlength{\parindent}{0pt}
\setlength{\parskip}{1.5ex plus 1ex minus 1ex}
\title{Project: Search Engine}
\author{{\Large Group Z}\\Alice Liddell \quad Bob Ross \quad Cruella de Vil \quad David Attenborough\\\tt{\{alid,bros,cdev,datt\}@itu.dk}}
\date{\today{}}
%\date{December 14, 2019}
\begin{document}
\maketitle
\section*{Introduction}
%
This document reports on the search engine project that we developed
during the Introductory Programming course at the IT University of
Copenhagen.
%
In Sections~1--3 we will report on our solutions to the
mandatory tasks posed in the project description.
%
We also solved some of the challenges posed for these mandatory
tasks.
%
Their solution is described in the respective section along with the
solution to the mandatory task.
%
\todo{This is just an example template; you do not need to do these extensions.}
%
Furthermore, we developed the following extensions which are described
in detail in the following sections:
%
\begin{itemize}
\item
Section 4 describes our changes to the client GUI.
\item
Section 5 describes how we built a webcrawler.
\item
Section 6 describes how we supported the ``fuzzy search'' feature.
\end{itemize}
The description for each solution is roughly split up into the
following parts:
%
\begin{itemize}
\item
\textbf{Task:} A short review on the task that we had to solve.
\item
\textbf{Approach:} An informal, high-level description of how we solved the
task.
\item
\textbf{Solution:} A detailed technical description of our solution to the task.
\item
\textbf{Test:} Description of considerations for testing the
correctness of our solution.
\end{itemize}
%
Other parts (e.g. design, reflection, \ldots{}) will be included where
deemed appropriate. For instance, Task 1 includes a part on the
benchmarking results.
%
The source code of our project is handed in as a single
zip file called \texttt{<filename>.zip}.
%
\texttt{<insert directory>} contains the source code that solves the
mandatory tasks.
%
Our code is also available on ITU's Github:
\url{https://github.itu.dk/...}.
%
\subsection*{Statement of Contribution}
%
All authors contributed equally to all parts of the solution of the
mandatory tasks.
%
Bob Ross made the graphic design for the client GUI.
%
Bob Ross and Cruella de Vil conceived the idea of the web crawler,
Cruella de Vil implemented it, and Bob Ross documented it.
%
Alice Liddell is solely responsible for the ``fuzzy search'' feature.
\section{Inverted Indices}
\paragraph{Task}
%
This task involves reducing the amount of time the search engine takes
to respond to a query, by means of an inverted index.
%
%
An inverted index provides a fast way to access the webpages that
contain a certain word by storing a mapping from words to the list
of webpages that contain the word.
%
Figure~\ref{fig:index:uml} provides an example for such a data structure.
%
We are to implement two variants of an inverted index: one based on a
{\tt TreeMap}, and the other on a {\tt HashMap}. We are then to
evaluate the performance of these implementations, and to pick the
best one to use in our search engine.
%
\paragraph{Approach}
%
As per the code handout, we generalized the notion of an index into what we call an
{\tt Index}, that has the following two methods:
%
\begin{itemize}
\item {\tt build}: add description from your javadoc.
\item {\tt lookup}: add description from your javadoc.
\end{itemize}
%
We then implemented the inverted index by using Java's Map data structure.
%
Here, Java provides different implementations and we tested the
following variants in our code: ..., ....
\paragraph{Solution} We build the index data structures as described in Figure~\ref{fig:index:uml}.
We allowed to switch the implementation of the Map data structure by ... . Building the inverted index was accomplished by the following code:
\begin{lstlisting}[language=Java]
for each webpage page in list of webpages {
for each word p on page {
...
}
}
\end{lstlisting}
\begin{figure}[t]
\centering
\includegraphics[width=\textwidth]{figures/uml}
\caption{UML Diagram for the Software Architecture of our webserver.}
\label{fig:index:uml}
\end{figure}
\paragraph{Test} We verified the correctness of the {\tt build} and {\tt lookup} method with unit tests. These tests can be found in the file .... Here, the {\tt build} method of the inverted index was tested by .... (May add Java code here.)
The {\tt lookup} method of all three indexes was checked by carefully creating a list of test webpages and checking whether the lookup methods returns the expected results by looking at the following properties of the returned list: (i) its size and (ii) ... ? (Add Java code?)
\paragraph{Benchmark}
%
We evaluated which implementation provides the fastest running time for
lookup operation by ... (describe your benchmarking setup / approach).
%
We choose the following query words for the benchmark:
%
``itu, is, the, best''.
%
The running times of different index implementations can be found in
Table~\ref{tab:benchmark:indices} (or a plot).
%
\begin{table}[t]
\begin{tabular}{l|l|l}
Index & File & Avg. lookup time (ms)\\ \hline \hline
SimpleIndex & enwiki-tiny.txt & ... \\
SimpleIndex & enwiki-small.txt & ... \\
SimpleIndex & enwiki-medium.txt & ... \\
InvertedIndex w/ HashMap & enwiki-tiny.txt & ... \\
... & ... & ... \\
InvertedIndex w/ TreeMap & enwiki-tiny.txt & ... \\
... & ... & ...
\end{tabular}
\caption{Running times for different index implementations.}
\label{tab:benchmark:indices}
\end{table}
%
These benchmark results indicate that ... .
%
Based on these results, we decided that we use ... as the index in our
data structure.
\section{Refined Queries}
\section{Ranking Algorithms}
\section{Extension: Improving the Client GUI}
\section{Extension: Webcrawler}
\section{Extension: Fuzzy-Search}
\end{document}