CM-Tool: Corpus Making Tool
This repository gathers some useful programs to obtain experimental data and enable the construction of corpus about various topic.
Source
The "source" directory contains source data, which are raw text data from DBPedia.
Data
The "data" directery contains data in different representations:
- sequence of sentences ('dataRef.sentence.txt')
- AMRs Graph ('dataRef.amr.graph')
- AMR Linked Data ('dataRef.amr.rdf')
These data were obtained from the sources, by applying the script 'convert_text_to_amr.py'.
Script <convert_text_to_amr.py>
This script converts raw texts into AMR representations. It can be adapted as needed. Especially, parameters can be adjusted to specify the data to be processed.
Installation
This project was developp under Python 3 and Manjaro Linux system, but it should run on any common system.
First, it is recommended to use a virtual environment. For example, 'ssc-env' can be create and use with the following commands:
python3 -m venv ssc-env
source ssc-env/bin/activate
The necessary libraries are defined in the file 'requirements.txt'. They can be installed in the virtual environment using package installer as pip:
pip install -r requirements.txt
See specific installation instructions about amrlib (amrlib-install).
So, it is necessary to install the models used by amrlib library. Models can be downloaded from amrlib-models.
These files need to be extracted and reside in the install directory under amrlib/data and should be named model_stog for the default parse model. The default models is loaded with `stog = amrlib.load_stog_model()'. To have multiple models of the same type, you'll need to supply the directory name when loading, ie 'stog = amrlib.load_stog_model(model_dir='.../amrlib/data/model_parse_t5-v0_1_0')'.
Usage
Parameters can be adjusted in the script code as needed. The script take data ref as argument (for example, 'test'). It can be run using command line:
python3 convert_text_to_amr.py test
Library
The "lib" directory contains useful library.
References
amrlib: A python library that makes AMR parsing, generation and visualization simple.
amr-ld: A Python library for mapping AMRs to linked data formats (such as RDF and JSON-LD).
Burns, G.A., Hermjakob, U., Ambite, J.L. (2016). Abstract Meaning Representations as Linked Data. In: , et al. The Semantic Web – ISWC 2016. ISWC 2016. Lecture Notes in Computer Science(), vol 9982. Springer, Cham. https://doi.org/10.1007/978-3-319-46547-0_2
DBPedia: A crowd-sourced community effort to extract structured content from the information created in various Wikimedia projects.