[an error occurred while processing this directive]

Task Details

N-grams of words are often used in natural language processing and information extraction tasks. An N-gram is a contiguous sequence of N words. So for example, if we have the phrase "the book is on the table" and we want to extract all N-grams with N=3 then the N-grams would be:

  • the book is
  • book is on
  • is on the
  • on the table

In this contest, the task is to search documents and return strings from a given set, as quickly as possible. Each string represents an N-gram. We will provide an initial set of strings which you may process and index. Once this is done, we will begin issuing a workload consisting of a series of queries (documents) and N-gram updates (insertions or deletions), arbitrarily interleaved. For each N-gram insertion or deletion, the list of N-grams of interest is updated accordingly. For each new query (document) arriving, the task is to return as fast as possible the N-grams of the currently up-to-date list that are found in the document. These should be presented in order of their first appearance in the document. If one N-gram is a prefix of another and the larger one is in the document, then the shorter one is presented first.

Input to your program will be provided on the standard input, and the output must appear on the standard output.

Testing Protocol

Our test harness will first feed the initial set of N-grams to your program's standard input. Your program will receive multiple lines where each one contains a string which represents an N-gram. The initial set ends with a line containing the character 'S'.

After sending the initial set of strings, our test harness will monitor your program's standard output for a line containing the character 'R' (case insensitive, followed by the new line character '\n'). Your program uses this line to signal that it is done ingesting the initial set of N-grams, has performed any processing and/or indexing on it and is now ready to receive the workload.

The test harness delivers the workload in batches. Each batch consists of a sequence of operations provided one per line followed by a line containing the single character 'F' that signals the end of the batch. Each operation is represented by one character ('Q', 'A' or 'D') that defines the operation type, followed by a space and either an N-gram or a document.

The three operation types are as follows:

  • 'Q'/query: This operation needs to be answered with the N-grams that have been found in the document. Your program will provide a line for each document. The line contains all the extracted N-grams separated by '|'. If there are no extracted N-grams, your program should answer with -1.
  • 'A'/add: This operation requires you to modify your set of N-grams by adding a new one. If the specified N-gram already exists, the set should remain unchanged. This operation should not produce any output.
  • 'D'/delete: This operation requires you to modify your set of N-grams by removing an N-gram. If the specified N-gram does not exist, the set should remain unchanged. This operation should not produce any output.

After the end of every batch, our test harness will wait for output from your program before providing the next batch. You need to provide as many lines of output as the number of the query ('Q') operations in the batch - each line containing the ordered N-grams that have matched separated by '|'. Your program is free to process operations in a batch concurrently. However, the query results in the output must reflect the order of the queries within the batch.

After the last batch, the standard input stream of your program will be closed by our test harness.

Your solution will be evaluated for correctness and execution time. Execution time measurement does not start until your program signals (with 'R') that it is finished ingesting the initial set of strings. Thus, you are free to pre-process or index the N-grams as you see fit without penalty, as long as your program runs within the overall testing time limit. Concurrent request execution within each batch is allowed and encouraged, as long as the results mimic a sequential execution of the operations within the batch. In particular, the result for each query must reflect all additions and deletions that precede it in the workload sequence, and must not reflect any additions and deletions that follow it.

Try-It-Yourself Example

We are providing you with this widget in order to test your understanding of the task and the testing protocol. This is Javascript based and will run on your own processor. Note that it is not intended for very large input files, which might crash your browser.

To use this widget, simply change the input on the left and hit the Solve button. Your expected output will be displayed in the right side. You are encouraged to use it to understand exactly the expected ordering of the matched N-grams (i.e., especially in case of substrings that have matched).

Reference Solution

We have implemented a simple reference solution in Python. The reference solution is in the format required for submission. You can create a submittable submission.tar.gz file using the included package.sh script.

Evaluation Process

Our testing infrastructure will evaluate each submission by unpacking the submission file, compiling the submitted code (using your submitted compile.sh script), and then running a series of tests (using your submitted run.sh script). Extraction time is limited to 10 seconds, and compilation time is limited to 60 seconds. Submissions that exceed these limits will not be tested at all.

Each test uses the test harness to supply an initial dataset and a workload to your submitted program. The total time for each test is limited to 4-5 minutes (different tests may have slightly different limits). The total per-test time includes the time to ingest the dataset and the time to process the queries and updates in the workload. These per-test execution time limit may be reduced later in the contest.

Submissions will be unpacked, compiled and tested in a network-isolated container with characteristics shown in the following table:

Processor 2x Intel Xeon E5-2630 v4 @ 2.20GHz (3.10GHz turbo)
Configuration 20 Cores / 40 Hyperthreads
Main Memory 128 GB
Operating System Ubuntu 16.04.1 LTS
Software Autoconf 2.69, Automake 1.15, Boost 1.58, CMake 3.5.1, Go 1.7.4, Maven 3.3.9, Node.js 6.9.3, Oracle Java 8, Python 2.7.12 (as python), Python 3.5.2 (as python3), Ruby 2.3.0 (as ruby2.3), Rust 1.14.0, TBB 4.4, ant 1.9.6, clang 3.8, gcc/g++ 5.4.0, gccgo 6.0.0, jemalloc 3.6.0

A submission is considered to be rankable if it correctly processes all of the test workloads within their time limits. Rankable submissions will be scored according to the sum of their execution times on the test workloads. As discussed in the testing protocol description, initial import and indexing of strings is not included in a submission's score. The leaderboard will show the best rankable submission for each team that has at least one such submission.

There are currently three test workloads. Additional workloads may be added later in the contest.

Once submissions have closed, we will review them to choose five contest finalists. We will consider at least the top five submissions from the leaderboard as of the time that submissions close, and we may consider more than that. We will review each of those submissions to ensure that they adhere to the contest guidelines. Please make sure that your README file accurately describes both your submission and your team. We will also subject submissions to additional, unpublished tests, which will contribute to the final ranking. Our intention is to notify the five finalists by 31 March, 2017 at the latest.

Submitting

The SIGMOD 2017 programming contest management system accepts submissions consisting of a single compressed tar file called submission.tar.gz. Submission files must be no larger than 5 MB - larger files will be rejected. Each submission must include the following files/directories:
  • run.sh
    This script is responsible for running your executable, which should interact with our test harness according to the testing protocol, reading from standard input and writing results to standard output.
  • README
    This file must contain information about the submission, including:
    1. team name
    2. for each team member: full name, e-mail, institution, department, and degree program
    3. advisor/supervisor name (if any)
    4. a brief description of the solution
    5. a list of the third party code used and their licenses
  • src/
    This directory must contain all of the source code.
  • compile.sh
    This shell script must be able to compile the source contained in the src directory. The produced executable file(s) must be located within the src directory. The benchmark environment is isolated and without internet access. You will have to provide all files required for successfull compilation.

You can use the reference solution, which has the required format, as a starting point.

You are strongly encouraged to ensure that your solution is correct before submitting it. To this end, we have packaged our test harness and a test case for download. This dataset contains wikipedia titles as N-Grams and papers from arxiv.org as documents. You can use the test harness to test your solution locally before submitting it.

Submissions will be evaluated and results will appear on the leaderboard. Teams are allowed to make any number of submissions. However, all submissions must be made by 22 March 2017, 23:59 ΕΕΤ (GMT+2).

Rules

  • The 2017 SIGMOD Programming Contest is open to undergraduate and graduate students from degree-granting institutions all over the world. However, students associated with the organizers' institution (University of Athens) are not eligible to participate.
  • Teams must consist of individuals currently registered as graduate or undergraduate students at an accredited academic institution. A team may be formed by one or more students, who need not be enrolled at the same institution. Several teams from the same institution can compete independently, but one person can be a member of only one team. There is no limit on team size. Teams can register on the contest site after 6 February, 2017.
  • All submissions must consist only of code written by the team or open source licensed software (i.e., using an OSI-approved license). For source code from books or public articles, clear reference and attribution must be made. Final submissions must be made by 22 March 2017, 23:59 ΕΕΤ (GMT+2).
  • All teams must agree to license their code under an OSI-approved open source license. By participating in this contest, each team agrees to publish its source code. The finalists' implementations will be made public on the contest website.
  • A team is eligible for the prize if at least one of its members will come to present the work at the SIGMOD 2017 conference. The travel grants will only be granted to eligible teams.