Skip to content
Snippets Groups Projects
Select Git revision
  • master default protected
1 result

cm-tool

  • Clone with SSH
  • Clone with HTTPS
  • CM-Tool: Corpus Making Tool


    This repository gathers some useful python scripts to obtain experimental data and enable the construction of corpus about various topic.

    Input Data

    The "inputData" directory contains source data, which are raw text data from different sources (such as DBPedia).

    Output Data

    The "outputData" directery contains produced data, including a sequence of sentences ('dataRef.sentence.txt'), and, for each sentence, some files corresponding to various representations such as:

    • Textual AMRs Graph in PENMAN format ('dataRef.amr.penman')
    • Vizual AMRs Graph in DOT and PNG format ('dataRef.amr.dot', 'dataRef.amr.png')
    • AMR Linked Data in nTriple and Turtle format ('dataRef.amr.nt', 'dataRef.amr.ttl')

    These data were obtained from the sources, by applying the script 'convert_text_to_amr.py'. The module used can be specified in the file name (e.g. .stog for the STOG model).

    Script <convert_text_to_amr.py>

    This script converts raw texts into AMR representations. It can be adapted as needed. Especially, parameters can be adjusted to specify the data to be processed.

    Installation

    This project was developp under Python 3 and Manjaro Linux system, but it should run on any common system.

    First, it is recommended to use a virtual environment. For example, 'ssc-env' can be create and use with the following commands:

    python3 -m venv ssc-env
    
    source ssc-env/bin/activate

    The necessary libraries are defined in the file 'requirements.txt'. They can be installed in the virtual environment using package installer as pip:

    pip install -r requirements.txt

    See specific installation instructions about amrlib (amrlib-install).

    So, it is necessary to install the models used by amrlib library. Models can be downloaded from amrlib-models.

    These files need to be extracted and reside in the install directory under amrlib/data and should be named model_stog for the default parse model. The default models is loaded with `stog = amrlib.load_stog_model()'. To have multiple models of the same type, you'll need to supply the directory name when loading, ie 'stog = amrlib.load_stog_model(model_dir='.../amrlib/data/model_parse_t5-v0_1_0')'.

    Usage

    Parameters can be adjusted in the script code as needed. The script take data ref as argument (for example, 'test'). It can be run using command line:

    python3 convert_text_to_amr.py test

    Library

    The "lib" directory contains useful libraries.

    References


    amrlib: A python library that makes AMR parsing, generation and visualization simple.

    amr-ld: A Python library for mapping AMRs to linked data formats (such as RDF and JSON-LD).

    Burns, G.A., Hermjakob, U., Ambite, J.L. (2016). Abstract Meaning Representations as Linked Data. In: , et al. The Semantic Web – ISWC 2016. ISWC 2016. Lecture Notes in Computer Science(), vol 9982. Springer, Cham. https://doi.org/10.1007/978-3-319-46547-0_2

    DBPedia: A crowd-sourced community effort to extract structured content from the information created in various Wikimedia projects.