Skip to content
Snippets Groups Projects
Select Git revision
  • master default protected
1 result

nlreqdataset-unl-enco

  • Clone with SSH
  • Clone with HTTPS
  • David Rouquet's avatar
    David Rouquet authored
    Add batch_unlizeToNotebook / add ipynb data / commented code to execute notebooks in conversion / rescale svg in notebooks
    1042e1cc
    History
    Name Last commit Last update
    data
    scripts
    .gitlab-ci.yml
    README.md

    nlreqdataset-unl-enco

    This repo contains all the nlreqdataset of system requirements (http://fmt.isti.cnr.it/nlreqdataset/), enconverted in UNL with http://unl.ru/deco.html.

    The dataset is presented in the following abstract:

    PURE: a Dataset of Public Requirements Documents

    Ferrari, Alessio; Spagnolo, Giorgio Oronzo; Gnesi, Stefania

    This paper presents PURE (PUblic REquirements dataset), a dataset of 79 publicly available natural language requirements documents collected from the Web. The dataset includes 34,268 sentences and can be used for natural language processing tasks that are typical in requirements engineering, such as model synthesis, abstraction identification and document structure assessment. It can be further annotated to work as a benchmark for other tasks, such as ambiguity detection, requirements categorisation and identification of equivalent re-quirements. In the associated paper, we present the dataset and we compare its language with generic English texts, showing the peculiarities of the requirements jargon, made of a restricted vocabulary of domain-specific acronyms and words, and long sentences. We also present the common XML format to which we have manually ported a subset of the documents, with the goal of facilitating replication of NLP experiments. The XML documents are also available for download.

    The paper associated to the dataset can be found here:

    https://ieeexplore.ieee.org/document/8049173/

    More info about the dataset is available here:

    http://nlreqdataset.isti.cnr.it

    Preprint of the paper available at ResearchGate:

    https://goo.gl/HxJD7X

    Usage of the unlizeXml.py script to enconvert a document

    The encoding script works on xml files conforming to ./data/orig/req_document.xsd

    Examples of input and outputs are provided in ./data/examples/

    Ziped folders of "unlized" XML files of the corpus are available in the ./data folder.

    ‼️ unlizeXml.py ignores namespaces in the XML document.

    First clone the repo (or at least download the scripts folder):

    git clone https://gitlab.tetras-libre.fr/unl/nlreqdataset-unl-enco.git

    Then enter the scripts folder :

    cd nlreqdataset-unl-enco/scripts

    The main Python 3 script to encode is unlizeXml.py.

    It relies on the java executable of unlTools that is included. You might want to update it with a newer version possibly available at https://gitlab.tetras-libre.fr/unl/unlTools/-/releases

    Basic usage is :

    python unlizeXml.py <input-file-path> <output-file-path>

    further options are described using the --help tag :

    $ python unlizeXml.py --help
    Usage: unlizeXml.py [OPTIONS] INPUT OUTPUT
    
    Options:
      --lang [en|ru]
      --dry-run / --no-dry-run  if true do not send request to unl.ru
      --svg / --no-svg          Add svg node representing unl graph
      --unltools-path FILE      Path of the unltools jar
      --help                    Show this message and exit.

    batch_unlizeXml.sh is an example of script to encode a batch of files in a folder the folder name is hardcoded in the script)

    Usage of the unlizeToNotebook.py script to convert a document to a Jupyter notebook for UNL graph visualisation and post edition

    The encoding script works on xml files conforming to ./data/orig/req_document.xsd or on enconverted files (outputs of of the unlizeXml.py script). In the first case it will firstly enconvert the files using http://unl.ru/deco

    Examples of input and outputs are provided in ./data/examples/

    Ziped folders of "unlized" XML files of the corpus are (or will be) available in the ./data folder.

    To convert a file use a command like the following (it takes a moment because the notebook is executed before saving) :

    python unlizeToNotebook.py input.xml output.ipynb

    batch_unlizeToNotebook.sh is an example of script to encode a batch of files in a folder the folder name is hardcoded in the script)

    By default, the ipynb notebook uses a unlTools jar executable to convert UNL texts to SVG graphs. You can download the jar in the release section of unlTools, it must be placed in the same directory as the notebook.

    As an alternative, you can modify the second cell of the notebook so it uses the unl2rdf Webservice instead of a local jar. Modify svg = unl2dot(unl, "unl2rdf-app-1.0-SNAPSHOT-jar-with-dependencies.jar") in svg = unl2dotWeb(unl) as explained in the comments.