Old version New version
\documentclass{sig-alternate} \documentclass{sig-alternate}
\usepackage{graphicx} \usepackage{graphicx}
\usepackage{hyperref} \usepackage{hyperref}
\usepackage{verbatim} \usepackage{verbatim}
\usepackage{amsmath} \usepackage{amsmath}
   
  %My LaTeX doesn't have \plusminus so I added this. It should be equivalent.  
  \providecommand{\plusminus}{\pm}  
\title{Indexed Signature Slice Lists for Fast Nearest-Neighbours Search} \title{Indexed Signature Slice Lists for Fast Nearest-Neighbours Search}
\setlength{\parskip}{2ex} \setlength{\parskip}{2ex}
\begin{document} \begin{document}
\maketitle \maketitle
\begin{abstract} \begin{abstract}
Signature Files in the Information Retrieval context are a techinque used to represent documents, images, and other abstract objects with binary signatures of fixed length.  The signatures have the property that they preserve in signature space the mutual topological relationships that exist in the original representation of objects. The original representation is typically some form of very high-dimensional, often highly sparse, feature vector derived with some probabilistic, language model, or other suitable feature extraction/definition approach.   Binary signatures offer a compression of the original representation onto a lower dimensional and dense representation.  It facilitate efficient similarity computations using the Hamming distance\cite{hamming1950error} measure. This efficiency motivates most applications of signatures.  Locality Sensitive Hashing is a method of dimensionality reduction that uses binary signatures and often used in near-duplicate detection in image or text document collections. There are numerous publications about successful applications of Signature files.   While signatures offer an acceptable measure of similarity, performing a search with a given search argument (a signature) requires Hamming distance calculation against every signature in a collection.  This quickly becomes excessive when dealing with web-sized collections, and there lies the problem of acalability. This paper addresses the scalability problem in signature search.  We describes a new method of indexing and searching large binary signature collections to efficiently find similar signatures in very large collections. Experimental results demonstrate that our approach is effective in finding a set of nearest signatures to a given search argument (signature) with high efficiency and high fidelity. Signature Files in the Information Retrieval context are a technique used to represent documents, images and other abstract objects with binary signatures of fixed length.  The signatures have the property that they preserve in signature space the mutual topological relationships that exist in the original representation of objects. The original representation is typically some form of very high-dimensional, often highly sparse, feature vector derived with some probabilistic, language model, or other suitable feature extraction/definition approach.   Binary signatures offer a compression of the original representation onto a lower dimensional and dense representation.  It facilitates efficient similarity computations using the Hamming distance\cite{hamming1950error} measure. This efficiency motivates most applications of signatures.  Locality Sensitive Hashing is a method of dimensionality reduction that uses binary signatures often used in near-duplicate detection in image or text document collections. There are numerous publications about successful applications of Signature files. While signatures offer an acceptable measure of similarity, performing a search with a given search argument (a signature) requires a Hamming distance calculation against every signature in a collection.  This quickly becomes excessive when dealing with web-sized collections, and there lies the problem of scalability. This paper addresses the scalability problem in signature search.  We describe a new method of indexing and searching large binary signature collections to efficiently find similar signatures in very large collections. Experimental results demonstrate that our approach is effective in finding a set of nearest signatures to a given search argument (signature) with high efficiency and high fidelity.
\end{abstract} \end{abstract}
\category{H.3.3}{Information Search and Retrieval}{Clustering, Retrieval Models, Search Process} \category{H.3.3}{Information Search and Retrieval}{Clustering, Retrieval Models, Search Process}
\terms{Algorithms, Experimentation, Performance, Theory} \terms{Algorithms, Experimentation, Performance, Theory}
\keywords{Document Signatures, Near-Duplicate Detection} \keywords{Document Signatures, Near-Duplicate Detection}
\section{Document Signatures} \section{Document Signatures}
Topological signatures, as described by Geva et al.\cite{geva2011topsig} are a refinement on Locality Sensitive hashing and the approach of using file signatures to represent documents, originally explored by Faloutsos and Christodoulakis\cite{faloutsos1984signature}.  These initial approaches were shown to be inferior to inverted file approaches by Zobel et al.\cite{zobel1998inverted}. Topology preserving signatures in our work are derived from a document-term matrix, randomly-projected onto a compact binary (+1,-1) subspace to produce a new, lower-order matrix (that can be efficiently stored) while preserving the topological relationshoips of documents in the original document-term matrix.  The geometric interpretation of this projection is that the document collection is mapped from its original representation as a term-document weight matrix, onto the ${\plusminus 1}^N$ hypercube.  Each document is mapped to a vertex of the hypercube, and the property of the projection is that it largely preserves the relative pairwise distances between points in input space, on the hypercube.  Furthermore, Geva et al describe an effective approach to performing ad-hoc keyword based search with signatures, that overcomes the difficulties identified in earlier models.  A complete description of this approach is given in \cite{geva2011topsig}, where it is demonstrated that for early precision the use of signatures for ad-hoc retrival leads to search performance that is not statistically different from state of the art models such as BM25, Language models, and DFR.     Topological signatures, as described by Geva et al\cite{geva2011topsig} are a refinement on Locality Sensitive hashing and the approach of using file signatures to represent documents, originally explored by Faloutsos and Christodoulakis\cite{faloutsos1984signature}.  These initial approaches were shown to be inferior to inverted file approaches by Zobel et al.\cite{zobel1998inverted} Topology preserving signatures in our work are derived from a document-term matrix, randomly-projected onto a compact binary (+1,-1) subspace to produce a new, lower-order matrix (that can be efficiently stored) while preserving the topological relationships of documents in the original document-term matrix.  The geometric interpretation of this projection is that the document collection is mapped from its original representation as a term-document weight matrix, onto the ${\pm 1}^N$ hypercube.  Each document is mapped to a vertex of the hypercube, and the property of the projection is that it largely preserves the relative pairwise distances between points in input space, on the hypercube.  Furthermore, Geva et al describe an effective approach to performing ad-hoc keyword based search with signatures, that overcomes the difficulties identified in earlier models.  A complete description of this approach is given in \cite{geva2011topsig}, where it is demonstrated that for early precision the use of signatures for ad-hoc retrieval leads to search performance that is not statistically different from state of the art models such as BM25, language models and DFR.  
   
For the sake of cmpleteness, we briefly describe some details about that file signature approach.  The original Faloustos et al approach is essentially a bloom filter, although not explicitly described as such in the original paper\cite{faloutsos1984signature}. Terms in a document are hashed to create pseudorandom sparse bit strings that are combined together (through a bitwise OR operator) to create a document signature. Queries are then tested by comparing the signatures of terms in the query (derived in the same manner) with the document signatures. If all the set bits in the query signature are also set in a document signature, there is a possibility that the terms appear in the document. For the sake of completeness, we briefly describe some details about that file signature approach.  The original Faloustos et al approach is essentially a bloom filter, although not explicitly described as such in the original paper\cite{faloutsos1984signature}. Terms in a document are hashed to create pseudorandom sparse bit strings that are combined together (through a bitwise OR operator) to create a document signature. Queries are then tested by comparing the signatures of terms in the query (derived in the same manner) with the document signatures. If all the set bits in the query signature are also set in a document signature, there is a possibility that the terms appear in the document.
Since it is possible for these term bit strings to combine to create a false positive for a term not present in the document, the document must be scanned to ensure that the terms in the query are actually present. This is characteristic of bloom filters, which are highly efficient but do result in false positives. The rate of false positives can be reduced by increasing the size of the bit strings at the expense of a greater storage requirement. Since it is possible for these term bit strings to combine to create a false positive for a term not present in the document, the document must be scanned to ensure that the terms in the query are actually present. This is characteristic of bloom filters, which are highly efficient but do result in false positives. The rate of false positives can be reduced by increasing the size of the bit strings at the expense of a greater storage requirement.
   
Signatures generated in this method can be used for information retrieval but have significant drawbacks, including the necessity of storing the terms present in the document to avoid false positives, and the cost of eliminating tose become highly excessive. They also require extensive amounts of storage space to ensure signatures are long enough to reduce the occurrence of these false positives. Signatures generated in this method can be used for information retrieval but have significant drawbacks, including the necessity of storing the terms present in the document to avoid false positives, and the cost of eliminating those become highly excessive. They also require extensive amounts of storage space to ensure signatures are long enough to reduce the occurrence of these false positives.
   
Locality Sensitive Hashing, as implemented and refined in the TopSig approach of  \cite{geva2011topsig}, fixes most of the problems identified with document signatures in terms of retrieval performance.   However, one problem remained unsolved in that paper - the scalability of the approach is still limited while it relies on sequential exhaustive hamming distance calculations. The approach required performing a Hamming distance calculation against every signature in the collection and it quickly becomes prohibitive when collection sizes become large enough. An approach is therefore needed to avoid performing this calculation against every signature, while still retrieving the nearest signatures. Locality Sensitive Hashing, as implemented and refined in the TopSig approach of  \cite{geva2011topsig}, fixes most of the problems identified with document signatures in terms of retrieval performance.   However, one problem remained unsolved in that paper- the scalability of the approach is still limited while it relies on sequential exhaustive hamming distance calculations. The approach required performing a Hamming distance calculation against every signature in the collection and it quickly becomes prohibitive when collection sizes become large enough. An approach is therefore needed to avoid performing this calculation against every signature, while still retrieving the nearest signatures.
   
**** NEED LITERATURE REVIEW AND REFERENCES TO OTHER APPROACHES TO  APPROXIMATE NEAREST NEIGHBOUR SEARCH.   THIS IS NOT A NEW PROBLEM.   MUST HAVE PRIOR WORK REFERENCED. %**** NEED LITERATURE REVIEW AND REFERENCES TO OTHER APPROACHES TO  APPROXIMATE NEAREST NEIGHBOUR SEARCH.   THIS IS NOT A NEW PROBLEM.   MUST HAVE PRIOR WORK REFERENCED.
   
\section{Indexed Signature Slice Lists} \section{Indexed Signature Slice Lists}
   
Indexed Signature Slice Lists (ISSLs) is a hybrid approach that combines the inverted file approach with the signature file approach. The purpose of an ISSL file is to provide an index into a signature file to allow searching for signatures close (in terms of Hamming distance) to a given search signature.  In a normal inverted file scheme the postings correspond to terms in the document collection and the search process combines those postings for a given set of query terms to produce a ranked list of results.  In an analogous manner, ISSL are posting lists, where signature slice values correspond to terms, and the postings correspond to signatures in the collection.  The signature slices of a query signature are used to combined the respective ISSL postings during search to provide a ranked list of signatures.   Indexed Signature Slice Lists (ISSLs) is a hybrid approach that combines the inverted file approach with the signature file approach. The purpose of an ISSL file is to provide an index into a signature file to allow searching for signatures close (in terms of Hamming distance) to a given search signature.  In a normal inverted file scheme the postings correspond to terms in the document collection and the search process combines those postings for a given set of query terms to produce a ranked list of results.  In an analogous manner, ISSL are posting lists, where signature slice values correspond to terms, and the postings correspond to signatures in the collection.  The signature slices of a query signature are used to combined the respective ISSL postings during search to provide a ranked list of signatures.  
   
The creation of ISSL proceed as follows.  We start from a set of binary signatures.  We will focus the discussion on 1024 bit signatures, but other signature sizes are possible and indeed used.  We treat each signature as a multiple set of 16 bit slices and view the signature as a set of 64 16-bit integers.     For efficiency reasons related to the processor and software architecture, 16 bits work well, but signature slices of other sizes are possible.  We find that this choice is adequate even as we work with web-scale collections.  The ISSL index maintains posting lists for each slice position and each of the  65,536 possible 16-bit integer values.  Hence, there are  4,194,304 posting lists.  When indexing a signature, each of the 64 bit slices in the signature appears in one of 64 posting lists.  The specific lists are identified by the combination of the slice position and the integer value of the signature slice.  The posting lists contain the signature-id in the usual manner.     The creation of ISSLs proceeds as follows. We start from a set of binary signatures.  We will focus the discussion on 1024 bit signatures, but other signature sizes are possible and indeed used.  We treat each signature as a multiple set of 16 bit slices and view the signature as a set of 64 16-bit integers.     For efficiency reasons related to processor and software architecture considerations, 16-bit slices work well, but signature slices of other sizes are possible.  We find that this choice is adequate even as we work with web-scale collections.  The ISSL index maintains posting lists for each slice position and each of the 65,536 possible 16-bit integer values.  Hence there are 4,194,304 posting lists.  When indexing a signature, each of the 64 bit slices in the signature appears in one of 64 posting lists.  The specific lists are identified by the combination of the slice position and the integer value of the signature slice.  The posting lists contain the signature IDs of the represented signatures.  
   
When searching an ISSL file for a search signature, the search signature is divided into slices in the same way and each slice is used to look up an ISSL. All the signature-ids listed in the retrieved ISSLs are consolidated.  Trivially, when searching for a specific signature that was previously indexed, we would find the signature-id in each of the posting lists corresponding to its 64 slices.   In general however we are looking for nearest signatures which are highly unlikely to be identical.  This means that we need to be able to compute the Hamming distance of signatures, from the search signatures, by using the ISSL postings.  For instance, a signature may not appear in any of the 64 postings that correspond to the query signature, but may be exactly one bit away in each of those slices.  That would amount to a Hamming distance of 64, and it may well be the case that this is the nearest signature.  So we need to process the postings in a different manner to the way we do with conventional term-based search.      
   
We need to consider not only the matching ISSL postings, but also their neighbourhood.  With 16-bit slices, we have 1 mathcing slice having the exact value, we have 16 slices that have a value that is 1 bit away, 120 slices that have a value that is 2 bits away, and so on.  The search is therefore repeated with the search slices progressively altered to cover a wider Hamming neighbourhood.  As the lookup of postings progresses, we tally the worst-case Hamming distance of signatures encountered in postings.  Initially all signatures are assumed to be at a distance of 1024.  As search progresses, all signature-ids that have not yet been encountered in any of the postings have a worst case Hamming distance of 1024.  On the other hand, a signature-id that is observed in a particular posting immediately receives a more optimistic estimate of its distance.  For instance, if a signature is observed in a slice with a distance of 3 bits away from a search slice then its worst case Hamming distance is reduced by 13.  Most nearby signatures to the search argument are observed sooner rather than later in all slice positions and their distances becomes known with full precision.  Of course if the process is carried through until all ISSL postings are consulted then the distance of all signatures from the search argument is precisely known.    When searching an ISSL file for a search signature, the search signature is divided into slices in the same way and each slice is used to look up an ISSL. All the signature IDs listed in the retrieved ISSLs are consolidated.  Trivially, when searching for a specific signature that was previously indexed, we would find the signature ID in each of the posting lists corresponding to its 64 slices.   In general however we are looking for nearest signatures which are highly unlikely to be identical.  This means that we need to be able to compute the Hamming distance of signatures from the search signatures by using the ISSL postings.  For instance, a signature may not appear in any of the 64 postings that correspond to the query signature, but may be exactly one bit away in each of those slices.  That would amount to a Hamming distance of 64, and it may well be the case that this is the nearest signature.  So we need to process the postings in a different manner to the way we do with conventional term-based search.   
   
This brings us to the key idea in this paper.  We conjecture, and test the conjecture, that it is not necessary to proceed with the calculation until complete information is available, in order to determine with high acuracy the neighbourhood af a search signature.    That is, we can stop the calculation early, and use the worst-case distance estimation, to choose the candidate list of nearest neighbours.  The desired length of the list of nearest neighboues is more often than not very short (it does not usually depend on the collection size, but rather on the users' capacity to review long result lists.)  Therefore, once we identified the candidate results list we can afford an accurate Hamming distance calculation, and a final sort, to produce our final ranked result list.  We need to consider not only the matching ISSL postings, but also their neighbourhood.  With 16-bit slices, we have 1 matching slice having the exact value, 16 slices that have a value that is 1 bit away, 120 slices that have a value that is 2 bits away, and so on.  The search is therefore repeated with the search slices progressively altered to cover a wider Hamming neighbourhood.  As the lookup of postings progresses, we tally the worst-case Hamming distance of signatures encountered in postings.  Initially all signatures are assumed to be at a distance of 1024.  As search progresses, all signature IDs that have not yet been encountered in any of the postings have a worst case Hamming distance of 1024.  On the other hand, a signature ID that is observed in a particular posting immediately receives a more optimistic estimate of its distance.  For instance, if a signature is observed in a slice with a distance of 3 bits away from a search slice then its worst case Hamming distance is reduced by 13.  Most nearby signatures to the search argument are observed sooner rather than later in all slice positions and their distances becomes known with full precision.  Of course, if the process is carried through until all ISSL postings are consulted then the distance of all signatures from the search argument is precisely known. 
   
The number of posting lists consulted is of course a function of the search-breadth in bits.  Table 1 provides the number of ISSL postings that have to be consulted as function of search breadth.  It also contains the percentage of distinct signature-ids encountered while processing the list.  This corresponds to the XXXXX collection of document signatures.  This brings us to the key idea in this paper:  we conjecture, and test the conjecture, that it is not necessary to proceed with the calculation until complete information is available, in order to determine with high accuracy the neighbourhood of a search signature.    That is; we can stop the calculation early and use the worst-case distance estimation to choose the candidate list of nearest neighbours.  The desired length of the list of nearest neighbours is more often than not very short (it does not usually depend on the collection size, but rather on the users' capacity to review long result lists.)  Therefore, once we identify the candidate results list we can afford an accurate Hamming distance calculation and a final sort to produce our final ranked result list. 
   
****  WHERE DO THE VALUES FOR THE 3RD COLUMN IN THE TABLE COME FROM?  The number of posting lists consulted is of course a function of the search-breadth in bits.  Table~\ref{tab:listschecked} provides the number of ISSL postings that have to be consulted as function of search-breadth.  It also contains the percentage of distinct signature IDs encountered while processing the list.  This corresponds to the Wall Street Journal collection of document signatures. 
**** IS THIS A RANDOM SET OF SIGNATURES?    
****IS IT A PARTICULALR COLLECTION?    
   
For 16-bit slices, this means an extra 16 searches per slice. This can then be repeated for every 2-bit error, every 3-bit error and so forth up to the slice length if desired; with $    
\displaystyle \sum_{i=0}^{k}{}^nC_i$ lists checked when allowing for $k$ bits of error in a search consisting of $n$-bit slices.    
\begin{table} \begin{table}
\begin{center} \begin{center}
\begin{tabular}{ | c | l | l | } \begin{tabular}{ | c | l | l | }
\hline \hline
Max error & Lists checked & \% of signatures \\ Bits changed & Lists checked & \% of signatures \\
\hline \hline
0 & 1 & 0.62\% \\ 0 & 1 & 0.62\% \\
1 & 17 & 7.13\% \\ 1 & 17 & 7.13\% \\
2 & 137 & 34.76\% \\ 2 & 137 & 34.76\% \\
3 & 697 & 79.62\% \\ 3 & 697 & 79.62\% \\
4 & 2517 & 98.93\% \\ 4 & 2517 & 98.93\% \\
5 & 6885 & 100.00\% \\ 5 & 6885 & 100.00\% \\
6 & 14893 & 100.00\% \\ 6 & 14893 & 100.00\% \\
7 & 26333 & 100.00\% \\ 7 & 26333 & 100.00\% \\
8 & 39203 & 100.00\% \\ 8 & 39203 & 100.00\% \\
9 & 50643 & 100.00\% \\ 9 & 50643 & 100.00\% \\
10 & 58651 & 100.00\% \\ 10 & 58651 & 100.00\% \\
11 & 63019 & 100.00\% \\ 11 & 63019 & 100.00\% \\
12 & 64839 & 100.00\% \\ 12 & 64839 & 100.00\% \\
13 & 65399 & 100.00\% \\ 13 & 65399 & 100.00\% \\
14 & 65519 & 100.00\% \\ 14 & 65519 & 100.00\% \\
15 & 65535 & 100.00\% \\ 15 & 65535 & 100.00\% \\
16 & 65536 & 100.00\% \\ 16 & 65536 & 100.00\% \\
\hline \hline
\end{tabular} \end{tabular}
\end{center} \end{center}
\caption[]{Number of slice lists checked based on the maximum error allowed. The third column shows the average portion of the collection covered by these lists\footnotemark[1]}  \caption[]{Number of slice lists checked based on the number of bits toggled in the search slice. The third column shows the average portion of the collection covered by these lists\footnotemark[1].} 
\label{tab:listschecked}   \label{tab:listschecked}  
\end{table} \end{table}
   
\footnotetext[1]{Tested on the Wall Street Journal collection indexed by TopSig, averaged over results from document similarity searches to 10 documents chosen with the first 10 5-digit numbers from A Million Random Digits with 100,000 Normal Deviates.}  \footnotetext[1]{Tested on the Wall Street Journal collection indexed by TopSig, averaged over 10 searches.} 
  % This is a footnote because the table description would be long and unwieldy otherwise  
Each signature is assigned a score for the purpose of this search. Each score starts at 0 and is incremented by the difference between the slice width and the Hamming distance between the search slice and the slice associated with the list in which this signature appeared. When the maximum error allowed is equal to the slice width, the score for each signature will be equal to the difference between the signature width and the Hamming distance. When the maximum error is lower, the amount of processing time required is reduced as fewer lists need to be considered. Assuming an equal distribution of signatures per list\footnotemark[2] allowing a maximum error of 3 bits while using 16-bit slices will result in 1.06\% of the work being done than would be required for a maximum distance of 16, but with potentially less accurate scores.  %I'll reference 'Wall Street Journal collection' later and if necessary rename it. It's some old TREC collection I think? 
   
\footnotetext[2]{Not necessarily a valid assumption for real data. For example, TopSig signatures typically contain more off-bits than on-bits, skewing the signature distribution towards the lists that cover slices with fewer bits set.}  %****  WHERE DO THE VALUES FOR THE 3RD COLUMN IN THE TABLE COME FROM? 
  %**** IS THIS A RANDOM SET OF SIGNATURES?  
After the scores for all signatures have been determined, the highest scoring signatures can be reranked using a full Hamming distance calculation to ensure that the ranking of these signatures is correct. At higher levels of maximum error, more processing time is required but the likelihood of documents that should appear in the top-K being omitted is reduced. There are other speed-accuracy tradeoffs available; increasing the number of signatures that are reranked using Hamming distance may result in fewer top documents being omitted at the cost of a linear increase in processing time.  %****IS IT A PARTICULALR COLLECTION? 
  %This information is contained in the footnote inside the table.  
\begin{figure}    
\begin{center}  Each signature is assigned a score for the purpose of this search. Each score starts at 0 and is incremented by the difference between the slice width and the Hamming distance between the search slice and the slice associated with the list in which this signature appeared. When the maximum search-breadth (the number of bits that can differ between the search slice and the signature slice) is equal to the slice width, the score for each signature will be equal to the difference between the signature width and the Hamming distance. When the maximum search-breadth is lower, the amount of processing time required is reduced as fewer lists need to be considered. Assuming an equal distribution of signatures per list\footnotemark[2] allowing a maximum search-breadth of 3 bits while using 16-bit slices will result in 1.06\% of the work being done than would be required for a maximum distance of 16, but with potentially less accurate scores. 
\fbox{\includegraphics[width=0.45\textwidth]{graphics/top10-match.pdf}}    
\end{center}  \footnotetext[2]{Not necessarily a valid assumption for real data. For example, TopSig signatures typically contain more off-bits than on-bits, skewing the signature distribution towards the lists that cover slices with fewer bits set. This is due to off-bits being the default state of bits in a TopSig signature; any bits unaffected by any of the terms in a document will assume this state. Bits that are set have an equal chance of being either on-bits or off-bits so the presence of this default state results in more off-bits than on-bits.} 
\caption{Out of the top 10 results returned, the number of results that match the actual top 10 results (ordered by Hamming distance)}    
\end{figure}  After the scores for all signatures have been determined, the highest scoring signatures can be reranked using a full Hamming distance calculation to ensure that the ranking of these signatures is correct. At higher levels of maximum search-breadth, more processing time is required but the likelihood of documents that should appear in the top-K being omitted is reduced. There are other speed-accuracy tradeoffs available; increasing the number of signatures that are reranked using Hamming distance may result in fewer top documents being omitted at the cost of a linear increase in processing time. 
\begin{figure}    
\begin{center}  %Commented these out for now. Need to work out what to do with them, but they don't even belong here. 
\fbox{\includegraphics[width=0.45\textwidth]{graphics/top100-match.pdf}}  %\begin{figure} 
\end{center}  %\begin{center} 
\caption{Out of the top 10 results returned, the number of results that match the actual top 100 results (ordered by Hamming distance)}  %\fbox{\includegraphics[width=0.45\textwidth]{graphics/top10-match.pdf}} 
\end{figure}  %\end{center} 
  %\caption{Out of the top 10 results returned, the number of results that match the actual top 10 results (ordered by Hamming distance)}  
  %\end{figure}  
  %\begin{figure}  
  %\begin{center}  
  %\fbox{\includegraphics[width=0.45\textwidth]{graphics/top100-match.pdf}}  
  %\end{center}  
  %\caption{Out of the top 10 results returned, the number of results that match the actual top 100 %results (ordered by Hamming distance)}  
  %\end{figure}  
   
\section{Implementation} \section{Implementation}
   
The ISSL search platform was developed on top of the open source signature search tool TopSig and is now a part of that tool. The tool was developed in C using pthreads for multithreading support. The ISSL search platform was developed on top of the open source signature search tool TopSig and is now a part of that tool. The tool was developed in C using pthreads for multithreading support.
   
The platform was developed with a focus on 16-bit ISSL slices. As the main justification for ISSL is performance-related, slices widths that weren't powers of 2 were not considered and both 8-bit and 32-bit slices were deemed inappropriate as 8-bit slices would not be able to substantially reduce the amount of processing work required and 32-bit slices would impose too much overhead, making the approach worthless for all but the largest of collections. The platform was developed with a focus on 16-bit ISSL slices. As the main justification for ISSL is performance-related, slices widths that weren't powers of 2 were not considered and both 8-bit and 32-bit slices were deemed inappropriate as 8-bit slices would not be able to substantially reduce the amount of processing work required and 32-bit slices would impose too much overhead, making the approach worthless for all but the largest of collections.
   
Due to the performance focus of ISSL, the approach described was designed to work entirely within memory while building the ISSL table and while searching it. While it would be possible to create an implementation that works from an ISSL table stored on disk without reading it into memory first, this approach would take careful consideration as many random accesses of the table are required. Due to the performance focus of ISSL, the approach described was designed to work entirely within memory while building the ISSL table and while searching it. While it would be possible to create an implementation that works from an ISSL table stored on disk without reading it into memory first, this approach would take careful consideration as many random accesses of the table are required.
   
Building the ISSL table was designed as a two-pass process; the first pass is to determine how large each of the lists are for the purpose of memory allocation and the second pass is to fill them. Each list is stored as an array of 32-bit integers that uniquely identify the signature within a signature file. The $s \times 2^{w}$ lists\footnotemark[3], along with an integer giving the number of signatures in this list are written to a file for reading by the ISSL search tool. Building the ISSL table was designed as a two-pass process; the first pass is to determine how large each of the posting lists are for the purpose of memory allocation and the second pass is to fill them. Each list is stored as an array of 32-bit integers that uniquely identify the signature within a signature file. The $s \times 2^{w}$ lists\footnotemark[3], along with an integer giving the number of signatures in this list are written to a file for reading by the ISSL search tool.
   
\footnotetext[3]{$s$ = number of slices in a signature. $w$ = slice width (in bits). For example, a 1024-bit signature with 16-bit slices will have 1024/16 or 64 slices per signature. ($s$ = 64, $w$ = 16)} \footnotetext[3]{$s$ = number of slices in a signature. $w$ = slice width (in bits). For example, a 1024-bit signature with 16-bit slices will have 1024/16 or 64 slices per signature. ($s$ = 64, $w$ = 16)}
   
The ISSL table can be read quickly if the on-disk representation and in-memory representation of the arrays of integers is the same. This will almost always be the case when the ISSL table is written and read on the same machine but may not be the case if the ISSL table is moved between machines of different endianness. The tool was designed to recognise this and fall back on a slower approach if necessary. The ISSL table can be read quickly if the on-disk representation and in-memory representation of the arrays of integers is the same. This will almost always be the case when the ISSL table is written and read on the same machine but may not be the case if the ISSL table is moved between machines of different endianness. The tool was designed to recognise this and fall back on a slower approach if necessary.
   
The amount of space taken up by the ISSL table scales linearly with both signature width and collection size. While a collection of 1,000,000 1024-bit signatures will take up $\displaystyle \frac{1000000 \times 1024}{8}$ bytes or $\sim{122}$ megabytes of space\footnotemark[4], the ISSL table for this same collection will require $4(1000000 \times 64 + 2^{16} \times 64)$ bytes, or $\sim{260}$ megabytes. The amount of space taken up by the ISSL table scales linearly with both signature width and collection size. While a collection of 1,000,000 1024-bit signatures will take up $\displaystyle \frac{1000000 \times 1024}{8}$ bytes or $\sim{122}$ megabytes of space\footnotemark[4], the ISSL table for this same collection will require $4(1000000 \times 64 + 2^{16} \times 64)$ bytes, or $\sim{260}$ megabytes.
   
\footnotetext[4]{Plus overhead. TopSig imposes a mandatory overhead of $33 + w$ bytes per signature, where $w$ is the maximum length of any document identifier. This overhead can potentially be considerable.} \footnotetext[4]{Plus overhead; TopSig imposes a mandatory overhead of $33 + w$ bytes per signature, where $w$ is the maximum length of any document identifier. This overhead can potentially be considerable.}
   
The potentially large sizes of the ISSL table and collection can impose limitations on the collections that can be used with ISSL depending on the hardware available as both signature and table are stored in memory\footnotemark[5]. The potentially large sizes of the ISSL table and collection can impose limitations on the collections that can be used with ISSL depending on the hardware available as both signature and table are stored in memory\footnotemark[5].
   
\footnotetext[5]{This implementation of ISSL stores both in memory. This is not necessarily required; for example, if a limited number of searches are going to be performed on a particular collection the scoring phase and the reranking phase can be separated with the intermediate results stored in memory or on disk. This implementation of ISSL was designed to create quick responses to individual queries while keeping all information in memory, therefore necessitating the storage of both signatures and ISSLs in memory.} \footnotetext[5]{This implementation of ISSL stores both in memory. This is not necessarily required; for example, if a limited number of searches are going to be performed on a particular collection the scoring phase and the reranking phase can be separated with the intermediate results stored in memory or on disk. This implementation of ISSL was designed to create quick responses to individual queries while keeping all information in memory, therefore necessitating the storage of both signatures and ISSLs in memory.}
   
The third data structure that uses a considerable amount of memory is the score table, which needs one element for each signature. For simplicity 32-bit integers were used for this implementation but 16-bit integers are fine if the signature width is lower than $2^{16}$. This structure still ends up being much smaller than the other two, at only $\sim{4}$ megabytes for our example 1,000,000 signature collection. The score table is necessary as the score for any signature may be increased by any of the lists; it is not feasible to keep a top-K list or similar structure that only contains the highest scoring signatures. As table~\ref{tab:listschecked} shows, most of the scores are touched even at relatively low error thresholds. The third data structure that uses a considerable amount of memory is the score table, which needs one element for each signature. For simplicity 32-bit integers were used for this implementation but 16-bit integers are fine if the signature width is lower than $2^{16}$. This structure still ends up being much smaller than the other two, at only $\sim{4}$ megabytes for our example 1,000,000 signature collection. The score table is necessary as the score for any signature may be increased by any of the posting lists; it is not feasible to keep a top-K list or similar structure that only contains the highest scoring signatures. As table~\ref{tab:listschecked} shows, most of the scores are touched even at relatively low search-breadth thresholds.
   
The ISSL search tool begins the process to search for a particular signature by resetting the score table to 0 and iterating through the search signature, one 16-bit slice at a time. For each slice, the error array\footnotemark[6] is iterated through until the allowable error threshold is met. For example, if the allowable error threshold is set at 4, the first 2517 values (see table~\ref{tab:listschecked}) of the error array will iterated through in the process of scoring documents. These 2517 values will contain all possible 16-bit values with between 0 and 4 on-bits. The current value is combined with the search slice by exclusive or and the result, combined with the position of the slice within the signature, identifies one list in the ISSL table. This list is then iterated through and each signature that appears on the list gets its score incremented by $16-n$ where $n$ is the number of on-bits in the current error value.  The ISSL search tool begins the process to search for a particular signature by resetting the score table to 0 and iterating through the search signature, one 16-bit slice at a time. For each slice, the Hamming neighbourhood array\footnotemark[6] is iterated through until the allowable search-breadth threshold is met. For example, if the allowable search-breadth threshold is set at 4, the first 2517 values (see table~\ref{tab:listschecked}) of the Hamming neighbourhood array will iterated through in the process of scoring documents. These 2517 values will contain all possible 16-bit values with between 0 and 4 on-bits. The current value is combined with the search slice by exclusive or and the result, combined with the position of the slice within the signature, identifies one list in the ISSL table. This list is then iterated through and each signature that appears on the list gets its score incremented by $16-n$ where $n$ is the number of on-bits in the current value in the Hamming neighbourhood array. 
   
\footnotetext[6]{To simplify determining the possible permutations of 1-bit, 2-bit, 3-bit errors and so on, the ISSL search tool precomputes an array of all possible 16-bit signatures and sorts the array by the number of on-bits in each value. A slice can be combined with these values with an exclusive or operation to produce the errors required; the first element will produce no error, while the last element will flip all of the bits of the search slice.}  \footnotetext[6]{To simplify determining the possible permutations of bits that may be toggled in expanding the search neighbourhood, the ISSL search tool precomputes a Hamming neighbourhood array of all possible 16-bit signatures and sorts the array by the number of on-bits in each value. A slice can be combined with these values with an exclusive or operation to produce slices with the required number of bits; combining the search slice with the first element will produce the original slice, while using last element will produce the inverse of the original slice. Intermediate values will produce all possible versions of the slice with all possible permutations of flipped bits, allowing all permutations of a slice with a certain number of bits changed to be accessed with little further computation.} 
   
After this process is complete, the score table will contain scores for each signature. The top-K scores can now be extracted from the array using a heap or similar structure\footnotemark[7]. The top-K signatures then have their scores recomputed with a full Hamming distance calculation and are sorted to produce the final results. After this process is complete, the score table will contain scores for each signature. The top-K scores can now be extracted from the array using a heap or similar structure\footnotemark[7]. The top-K signatures then have their scores recomputed with a full Hamming distance calculation and are sorted to produce the final results.
   
\footnotetext[7]{In this implementation the top-K scores are extracted from the array using a K-sized array that holds the K-highest scoring signatures seen so far and replaces the signature with the lowest score when a signature with a higher score is seen.} \footnotetext[7]{In this implementation the top-K scores are extracted from the array using a K-sized array that holds the K-highest scoring signatures seen so far and replaces the signature with the lowest score when a signature with a higher score is seen.}
   
\section{Performance Considerations and Scalability} \section{Performance Considerations and Scalability}
   
ISSL searches demonstrate an improvement in computational performance over the approach of calculating the Hamming distance with every signature in a collection; however, computational time and memory use still scale linearly with collection size, causing problems when dealing with large collections. However, like basic signature searching, this approach is also inherently conducive to parallel processing. ISSL searches demonstrate an improvement in computational performance over the approach of calculating the Hamming distance with every signature in a collection; however, computational time and memory use still scale linearly with collection size, causing problems when dealing with large collections. However, like basic signature searching, this approach is also inherently conducive to parallel processing.
   
The TopSig implementation of ISSL developed for this study implements basic support for parallel processing using the pthreads library. The user can specify via configuration file or command line the number of threads that should be used for searching. The threads are created for each search and each is assigned an approximately equal portion of the lists in the ISSL table. Each thread processes its assigned lists, adding scores to the score table with atomic operations\footnotemark[8]. When all threads have terminated the main thread continues as normal, performing the top-K extraction and Hamming distance calculations without using extra threads. This operation is efficient as it can be implemented without locks. The TopSig implementation of ISSL developed for this study implements basic support for parallel processing using the pthreads library. The user can specify via configuration file or command line the number of threads that should be used for searching. The threads are created for each search and each is assigned an approximately equal portion of the posting lists in the ISSL table. Each thread processes its assigned lists, adding scores to the score table with atomic operations\footnotemark[8]. When all threads have terminated the main thread continues as normal, performing the top-K extraction and Hamming distance calculations without using extra threads. This operation is efficient as it can be implemented without locks.
   
\footnotetext[8]{\_\_sync\_fetch\_and\_add(), which is portable across Intel architectures.} \footnotetext[8]{\_\_sync\_fetch\_and\_add(), which is portable across Intel architectures.}
   
The process of extracting the top-K elements from a list can also be parallelised and this may be beneficial for searching larger collections where this operation may consume a greater portion of the total processing time. Each thread can perform top-K extraction on separate portions of the list and the top-K lists can be merged at the end. The process of extracting the top-K elements from a list can also be parallelised and this may be beneficial for searching larger collections where this operation may consume a greater portion of the total processing time. Each thread can perform top-K extraction on separate portions of the list and the top-K lists can be merged at the end.
   
These approaches allow for performance improvements across a system with shared memory. For larger processing tasks, it may be useful to split the workload across multiple systems with independent memory. Signatures are also conducive to this task, as a signature file can be split into multiple parts and a separate ISSL created for each to divide the memory and processor burden across many machines. The final results can then be merged together providing the scores are preserved. These approaches allow for performance improvements across a system with shared memory. For larger processing tasks, it may be useful to split the workload across multiple systems with independent memory. Signatures are also conducive to this task, as a signature file can be split into multiple parts and a separate ISSL created for each to divide the memory and processor burden across many machines. The final results can then be merged together providing the scores are preserved.
   
Performance approaches may vary depending on whether focus is on improving the performance of individual queries as would be desirable in an interactive system or on reducing the amount of time required to process a large batch of queries as may be desirable for cluster generation. While the previously discussed techniques are geared towards optimising the former, for the latter dividing up the queries among threads may be more effective than dividing up the collection. The tradeoff made involves increasing the latency between submitting the query and retrieving results in exchange for reducing the amount of overall processing time required. Performance approaches may vary depending on whether focus is on improving the performance of individual queries as would be desirable in an interactive system or on reducing the amount of time required to process a large batch of queries as may be desirable for cluster generation. While the previously discussed techniques are geared towards optimising the former, for the latter dividing up the queries among threads may be more effective than dividing up the collection. The tradeoff made involves increasing the latency between submitting the query and retrieving results in exchange for reducing the amount of overall processing time required.
   
\section{Experimental Results} \section{Experimental Results}
\section{Conclusion} \section{Conclusion}
\bibliographystyle{plain} \bibliographystyle{plain}
\bibliography{bibliography} \bibliography{bibliography}
\end{document}   \end{document}