Skip to content
Snippets Groups Projects

CM-Tool: Corpus Making Tool


This repository gathers some useful programs to obtain experimental data and enable the construction of corpus about various topic.

Source

The "source" directory contains source data, which are raw text data from DBPedia.

Data

The "data" directery contains data in different representations:

  • sequence of sentences ('dataRef.sentence.txt')
  • AMRs Graph ('dataRef.amr.graph')
  • AMR Linked Data ('dataRef.amr.rdf')

These data were obtained from the sources, by applying the script 'convert_text_to_amr.py'.

Script <convert_text_to_amr.py>

This script converts raw texts into AMR representations. It can be adapted as needed. Especially, parameters can be adjusted to specify the data to be processed.

Installation

This project was developp under Python 3 and Manjaro Linux system, but it should run on any common system.

First, it is recommended to use a virtual environment. For example, 'ssc-env' can be create and use with the following commands:

python3 -m venv ssc-env

source ssc-env/bin/activate

The necessary libraries are defined in the file 'requirements.txt'. They can be installed in the virtual environment using package installer as pip:

pip install -r requirements.txt

See specific installation instructions about amrlib (amrlib-install).

So, it is necessary to install the models used by amrlib library. Models can be downloaded from amrlib-models.

These files need to be extracted and reside in the install directory under amrlib/data and should be named model_stog for the default parse model. The default models is loaded with `stog = amrlib.load_stog_model()'. To have multiple models of the same type, you'll need to supply the directory name when loading, ie 'stog = amrlib.load_stog_model(model_dir='.../amrlib/data/model_parse_t5-v0_1_0')'.

Usage

Parameters can be adjusted in the script code as needed. The script take data ref as argument (for example, 'test'). It can be run using command line:

python3 convert_text_to_amr.py test

Library

The "lib" directory contains useful library.

References


amrlib: A python library that makes AMR parsing, generation and visualization simple.

amr-ld: A Python library for mapping AMRs to linked data formats (such as RDF and JSON-LD).

Burns, G.A., Hermjakob, U., Ambite, J.L. (2016). Abstract Meaning Representations as Linked Data. In: , et al. The Semantic Web – ISWC 2016. ISWC 2016. Lecture Notes in Computer Science(), vol 9982. Springer, Cham. https://doi.org/10.1007/978-3-319-46547-0_2

DBPedia: A crowd-sourced community effort to extract structured content from the information created in various Wikimedia projects.