Skip to content
Snippets Groups Projects
Commit 319dc4ef authored by Aurélien Lamercerie's avatar Aurélien Lamercerie
Browse files

Clean and add data with turtle format (.ttl)

parent a2cc346e
No related branches found
No related tags found
No related merge requests found
Showing
with 3018 additions and 3141 deletions
# Solar System Corpus # Solar System Corpus
------------------------------------------------------------------------------- -------------------------------------------------------------------------------
This repository gathers experimental data about the solar system, This repository gathers experimental data about the solar system.
and some useful programs to obtain this data.
## Source ## Source
The "source" directory contains source data, which are raw text data The "source" directory contains source data, which are raw text data
from [DBPedia](https://dbpedia.org/). from [DBPedia](https://dbpedia.org/):
- test: some simple sentences for test
- solar-system: english abstract from [https://dbpedia.org/page/Solar_System](https://dbpedia.org/page/Solar_System)
## Data ## Data
The "data" directery contains data in different representations: The "data" directery contains data in different representations:
- sequence of sentences ('dataRef.sentence.txt') - sequence of sentences ('dataRef.sentence.txt')
- AMRs Graph ('dataRef.amr.graph') - AMRs Graph ('dataRef.amr.graph')
- AMR Linked Data ('dataRef.amr.rdf') - AMR Linked Data ('dataRef.amr.rdf')
These data were obtained from the sources, by applying the script These data were obtained from the sources, by using
'convert_text_to_amr.py'. [cm-tool](https://gitlab.tetras-libre.fr/tetras-mars/corpus/cm-tool) project.
## Script <convert_text_to_amr.py>
This script converts raw texts into AMR representations. It can be
adapted as needed. Especially, parameters can be adjusted to specify the data
to be processed.
### Installation
This project was developp under Python 3 and Manjaro Linux system, but it should
run on any common system.
First, it is recommended to use a
[virtual environment](https://docs.python.org/fr/3/tutorial/venv.html).
For example, 'ssc-env' can be create and use with the following commands:
python3 -m venv ssc-env
source ssc-env/bin/activate
The necessary libraries are defined in the file 'requirements.txt'.
They can be installed in the virtual environment using package installer as
pip:
pip install -r requirements.txt
See specific installation instructions about amrlib
([amrlib-install](https://amrlib.readthedocs.io/en/latest/install/)).
So, it is necessary to install the models used by amrlib library.
Models can be downloaded from
[amrlib-models](https://github.com/bjascob/amrlib-models).
These files need to be extracted and reside in the install directory under
amrlib/data and should be named model_stog for the default parse model.
The default models is loaded with `stog = amrlib.load_stog_model()'.
To have multiple models of the same type, you'll need to supply the directory name
when loading, ie
'stog = amrlib.load_stog_model(model_dir='.../amrlib/data/model_parse_t5-v0_1_0')'.
### Usage
Parameters can be adjusted in the script code as needed. The script take data ref
as argument (for example, 'test'). It can be run using command line:
python3 convert_text_to_amr.py test
## Library
The "lib" directory contains useful library.
# References
-------------------------------------------------------------------------------
[amrlib](https://github.com/bjascob/amrlib):
A python library that makes AMR parsing, generation and visualization simple.
[amr-ld](https://github.com/BMKEG/amr-ld/):
A Python library for mapping AMRs to linked data formats (such as RDF and JSON-LD).
Burns, G.A., Hermjakob, U., Ambite, J.L. (2016).
Abstract Meaning Representations as Linked Data.
In: , et al. The Semantic Web – ISWC 2016. ISWC 2016.
Lecture Notes in Computer Science(), vol 9982. Springer, Cham.
https://doi.org/10.1007/978-3-319-46547-0_2
[DBPedia](https://www.dbpedia.org/about/):
A crowd-sourced community effort to extract structured content from the information
created in various Wikimedia projects.
#!/usr/bin/python3.10
# -*-coding:Utf-8 -*
#==============================================================================
# Solar System Corpus: convert text to amr
#------------------------------------------------------------------------------
# Script to convert raw text into AMRs Graph
#==============================================================================
#==============================================================================
# Importing required modules
#==============================================================================
import amrlib
import re
import os
import sys
import subprocess
import shutil
#==============================================================================
# Parameters
#==============================================================================
# Input/Output Directories
INPUT_DIR = "source/"
OUTPUT_DIR = "data/"
# Reference Suffix
TEXT_SUFFIX = ".txt"
SENTENCE_SUFFIX = ".sentence.txt"
AMR_GRAPH_SUFFIX = ".amr.graph"
AMR_RDF_SUFFIX = ".amr.rdf"
# AMR Lib Models
AMR_LIB_DATA = '/home/lamenji/.local/lib/python3.10/site-packages/amrlib/data/'
AMR_MODEL_XFM_LARGE = AMR_LIB_DATA + 'model_parse_xfm_bart_large-v0_1_0'
amr_model = AMR_MODEL_XFM_LARGE
# AMRLD Parameters
AMRLD_DIR = 'lib/amrld/'
AMRLD_DIR_BACK = '../../'
WK_DIR = 'wk/'
AMRLD_WORKDIR = AMRLD_DIR + WK_DIR
#==============================================================================
# Functions to define filepath
#==============================================================================
def get_text_input_filepath(data_ref):
return INPUT_DIR + data_ref + TEXT_SUFFIX
def get_text_output_filepath(data_ref):
return OUTPUT_DIR + data_ref + TEXT_SUFFIX
def get_sentence_output_filepath(data_ref):
return OUTPUT_DIR + data_ref + SENTENCE_SUFFIX
def get_amr_graph_output_filepath(data_ref):
return OUTPUT_DIR + data_ref + AMR_GRAPH_SUFFIX
def get_amr_rdf_output_filepath(data_ref):
return OUTPUT_DIR + data_ref + AMR_RDF_SUFFIX
def get_amr_graph_amrld_filepath(data_ref):
return AMRLD_WORKDIR + data_ref + AMR_GRAPH_SUFFIX
def get_amr_rdf_amrld_filepath(data_ref):
return AMRLD_WORKDIR + data_ref + AMR_RDF_SUFFIX
def get_amr_graph_wk_filepath(data_ref):
return WK_DIR + data_ref + AMR_GRAPH_SUFFIX
def get_amr_rdf_wk_filepath(data_ref):
return WK_DIR + data_ref + AMR_RDF_SUFFIX
#==============================================================================
# Utilities
#==============================================================================
def is_valid_sentence(sentence):
""" True if the sentence is correct.
"""
is_empty = ((sentence == "") | (sentence == "\n"))
lang_mark_re = re.compile("\([a-z]+\)(.)*")
is_language_mark = lang_mark_re.match(sentence) is not None
return not (is_empty | is_language_mark)
def clean_sentence(sentence):
""" Sentence cleanup as needed """
sentence = re.sub("(\.)*\\n", "", sentence)
return sentence
def insert_id_line(id_num, data_ref, writing_file):
id_num += 1
id_line_str = "# ::id " + data_ref + "-" + str(id_num) + "\n"
writing_file.write(id_line_str)
return id_num
#==============================================================================
# Main Functions
#==============================================================================
def prepare_data_sentences(data_ref):
""" Prepare data before parsing """
print("-- Reading input files to recover a list of sentences")
sentences_list = list()
input_file = get_text_output_filepath(data_ref)
with open(input_file, "r") as reading_file: # r = read
for line in reading_file.readlines():
sentences = line.split(". ")
for sentence in sentences:
if is_valid_sentence(sentence):
sentence = clean_sentence(sentence)
sentences_list.append(sentence + ".")
print("----- number of sentences: " + str(len(sentences_list)))
output_file = get_sentence_output_filepath(data_ref)
print("-- Generating sentence file: " + output_file)
with open(output_file, "w") as writing_file: # w = write
first = True
for s in sentences_list:
if not first: writing_file.write("\n")
writing_file.write(s)
first = False
return sentences_list
def convert_sentences_to_graph(model, sentences):
"""Converting text sentences to AMR graphs"""
print("-- Loading AMR model")
stog = amrlib.load_stog_model(model_dir=model)
print("-- Converting sentences to AMR graphs")
graphs_list = []
for sentence in sentences:
graphs_list.extend(stog.parse_sents([sentence]))
print("----- number of graphs: " + str(len(graphs_list)))
return graphs_list
def write_amr_graph_output_file(graphs, data_ref):
""" Writing AMR graphs to output files"""
output_file = get_amr_graph_output_filepath(data_ref)
print("-- Generating AMR Graph file: " + output_file)
with open(output_file, "w") as writing_file: # w = write
id_num = 0
for graph in graphs:
out_graph = graph + "\n\n"
id_num = insert_id_line(id_num, data_ref, writing_file)
writing_file.write(out_graph)
def convert_graphs_to_rdf(data_ref):
""" Converting AMR graphs to AMR RDF """
# -- Filepath
input_file = get_amr_graph_output_filepath(data_ref)
input_amrld_file = get_amr_graph_amrld_filepath(data_ref)
output_amrld_file = get_amr_rdf_amrld_filepath(data_ref)
input_wk_file = get_amr_graph_wk_filepath(data_ref)
output_wk_file = get_amr_rdf_wk_filepath(data_ref)
output_file = get_amr_rdf_output_filepath(data_ref)
# -- AMR-LD processing
amrld_process = ["python3", "amr_to_rdf.py",
"-i", input_wk_file,
"-o", output_wk_file]
if (os.path.isfile(input_file)):
print("-- Converting AMR graphs to RDF using amr-ld library")
shutil.copyfile(input_file, input_amrld_file)
os.chdir(AMRLD_DIR)
subprocess.run(amrld_process)
os.chdir(AMRLD_DIR_BACK)
# -- Copy result
if (os.path.isfile(output_amrld_file)):
print("-- Generating AMR RDF file: " + output_file)
shutil.copyfile(output_amrld_file, output_file)
#==============================================================================
# Main function
#==============================================================================
def main(data_ref):
# -- Prepare the sentences to be converted
print("\n" + "[SSC] Data Preparation")
print("-- input data reference: " + data_ref)
source = get_text_input_filepath(data_ref)
destination = get_text_output_filepath(data_ref)
shutil.copyfile(source, destination)
sentences = prepare_data_sentences(data_ref)
# -- Convert sentences to graphs
print("\n" + "[SSC] Convert sentences to graphs")
print("-- library: amrlib")
print("-- model: " + amr_model)
graphs = convert_sentences_to_graph(amr_model, sentences)
write_amr_graph_output_file(graphs, data_ref)
# -- Convert graphs to RDF
print("\n" + "[SSC] Convert AMR graphs to AMR RDF")
print("-- library: amrlk")
convert_graphs_to_rdf(data_ref)
# -- Ending print
print("\n" + "[SSC] Done")
if __name__ == "__main__":
main(sys.argv[1])
Source diff could not be displayed: it is too large. Options to address this: view the blob.
This diff is collapsed.
<http://amr.isi.edu/rdf/core-amr#Concept> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/01/rdf-schema#Class> . <http://amr.isi.edu/rdf/core-amr#NamedEntity> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#Concept> .
<http://amr.isi.edu/rdf/core-amr#NamedEntity> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-Term" . <http://amr.isi.edu/amr_data/test-2#root01> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#AMR> .
<http://amr.isi.edu/rdf/amr-terms#domain> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#Role> . <http://amr.isi.edu/amr_data/test-1#root01> <http://amr.isi.edu/rdf/core-amr#root> <http://amr.isi.edu/amr_data/test-1#s> .
<http://amr.isi.edu/rdf/core-amr#Role> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/01/rdf-schema#Class> . <http://amr.isi.edu/rdf/core-amr#NamedEntity> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-EntityType" .
<http://amr.isi.edu/amr_data/test-1#root01> <http://amr.isi.edu/rdf/core-amr#has-id> "test-1" . <http://amr.isi.edu/amr_data/test-1#root01> <http://amr.isi.edu/rdf/core-amr#has-id> "test-1" .
<http://amr.isi.edu/entity-types#planet> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#NamedEntity> . <http://amr.isi.edu/rdf/core-amr#NamedEntity> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-Term" .
<http://amr.isi.edu/frames/ld/v1.2.2/FrameRole> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#Role> .
<http://amr.isi.edu/amr_data/test-2#root01> <http://amr.isi.edu/rdf/core-amr#has-sentence> "Earth is a planet." . <http://amr.isi.edu/amr_data/test-2#root01> <http://amr.isi.edu/rdf/core-amr#has-sentence> "Earth is a planet." .
<http://amr.isi.edu/amr_data/test-1#root01> <http://amr.isi.edu/rdf/core-amr#has-sentence> "The sun is a star." .
<http://amr.isi.edu/amr_data/test-2#p> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/entity-types#planet> .
<http://amr.isi.edu/amr_data/test-2#root01> <http://amr.isi.edu/rdf/core-amr#root> <http://amr.isi.edu/amr_data/test-2#p> .
<http://amr.isi.edu/rdf/core-amr#Role> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-Role" . <http://amr.isi.edu/rdf/core-amr#Role> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-Role" .
<http://amr.isi.edu/amr_data/test-1#root01> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#AMR> . <http://amr.isi.edu/amr_data/test-1#s> <http://amr.isi.edu/rdf/amr-terms#domain> <http://amr.isi.edu/amr_data/test-1#s2> .
<http://amr.isi.edu/rdf/core-amr#NamedEntity> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-EntityType" . <http://amr.isi.edu/rdf/core-amr#Frame> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-PropBank-Frame" .
<http://amr.isi.edu/rdf/amr-terms#domain> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#Role> .
<http://amr.isi.edu/frames/ld/v1.2.2/FrameRole> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-PropBank-Role" .
<http://amr.isi.edu/amr_data/test-2#p> <http://www.w3.org/2000/01/rdf-schema#label> "Earth" .
<http://amr.isi.edu/amr_data/test-1#s2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/amr-terms#sun> . <http://amr.isi.edu/amr_data/test-1#s2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/amr-terms#sun> .
<http://amr.isi.edu/rdf/core-amr#Concept> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/01/rdf-schema#Class> .
<http://amr.isi.edu/amr_data/test-1#s> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/entity-types#star> . <http://amr.isi.edu/amr_data/test-1#s> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/entity-types#star> .
<http://amr.isi.edu/rdf/amr-terms#sun> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#Concept> .
<http://amr.isi.edu/rdf/core-amr#Frame> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#Concept> . <http://amr.isi.edu/rdf/core-amr#Frame> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#Concept> .
<http://amr.isi.edu/amr_data/test-1#root01> <http://amr.isi.edu/rdf/core-amr#has-sentence> "The sun is a star." .
<http://amr.isi.edu/entity-types#star> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#NamedEntity> . <http://amr.isi.edu/entity-types#star> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#NamedEntity> .
<http://amr.isi.edu/amr_data/test-1#s> <http://amr.isi.edu/rdf/amr-terms#domain> <http://amr.isi.edu/amr_data/test-1#s2> .
<http://amr.isi.edu/amr_data/test-2#p> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/entity-types#planet> .
<http://amr.isi.edu/amr_data/test-1#root01> <http://amr.isi.edu/rdf/core-amr#root> <http://amr.isi.edu/amr_data/test-1#s> .
<http://amr.isi.edu/rdf/core-amr#Frame> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-PropBank-Frame" .
<http://amr.isi.edu/amr_data/test-2#p> <http://www.w3.org/2000/01/rdf-schema#label> "Earth" .
<http://amr.isi.edu/amr_data/test-2#root01> <http://amr.isi.edu/rdf/core-amr#has-id> "test-2" . <http://amr.isi.edu/amr_data/test-2#root01> <http://amr.isi.edu/rdf/core-amr#has-id> "test-2" .
<http://amr.isi.edu/amr_data/test-2#root01> <http://amr.isi.edu/rdf/core-amr#root> <http://amr.isi.edu/amr_data/test-2#p> . <http://amr.isi.edu/amr_data/test-1#root01> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#AMR> .
<http://amr.isi.edu/amr_data/test-2#root01> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#AMR> . <http://amr.isi.edu/entity-types#planet> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#NamedEntity> .
<http://amr.isi.edu/rdf/core-amr#Role> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/01/rdf-schema#Class> .
<http://amr.isi.edu/rdf/amr-terms#sun> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#Concept> .
<http://amr.isi.edu/rdf/core-amr#Concept> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-Concept" . <http://amr.isi.edu/rdf/core-amr#Concept> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-Concept" .
<http://amr.isi.edu/frames/ld/v1.2.2/FrameRole> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-PropBank-Role" . <http://amr.isi.edu/frames/ld/v1.2.2/FrameRole> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#Role> .
<http://amr.isi.edu/rdf/core-amr#NamedEntity> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#Concept> .
<http://amr.isi.edu/rdf/core-amr#Concept> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/01/rdf-schema#Class> . <http://amr.isi.edu/rdf/core-amr#NamedEntity> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-EntityType" .
<http://amr.isi.edu/rdf/core-amr#NamedEntity> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-Term" .
<http://amr.isi.edu/rdf/amr-terms#domain> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#Role> .
<http://amr.isi.edu/rdf/core-amr#Role> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/01/rdf-schema#Class> . <http://amr.isi.edu/rdf/core-amr#Role> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/01/rdf-schema#Class> .
<http://amr.isi.edu/amr_data/test-1#root01> <http://amr.isi.edu/rdf/core-amr#has-id> "test-1" . <http://amr.isi.edu/rdf/core-amr#Frame> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#Concept> .
<http://amr.isi.edu/entity-types#planet> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#NamedEntity> . <http://amr.isi.edu/rdf/amr-terms#domain> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#Role> .
<http://amr.isi.edu/amr_data/test-1#s> <http://amr.isi.edu/rdf/amr-terms#domain> <http://amr.isi.edu/amr_data/test-1#s2> .
<http://amr.isi.edu/amr_data/test-2#root01> <http://amr.isi.edu/rdf/core-amr#root> <http://amr.isi.edu/amr_data/test-2#p> .
<http://amr.isi.edu/amr_data/test-2#p> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/entity-types#planet> .
<http://amr.isi.edu/frames/ld/v1.2.2/FrameRole> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#Role> . <http://amr.isi.edu/frames/ld/v1.2.2/FrameRole> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#Role> .
<http://amr.isi.edu/amr_data/test-2#root01> <http://amr.isi.edu/rdf/core-amr#has-sentence> "Earth is a planet." . <http://amr.isi.edu/rdf/amr-terms#sun> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#Concept> .
<http://amr.isi.edu/rdf/core-amr#Role> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-Role" .
<http://amr.isi.edu/amr_data/test-1#root01> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#AMR> .
<http://amr.isi.edu/rdf/core-amr#NamedEntity> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-EntityType" .
<http://amr.isi.edu/amr_data/test-1#s2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/amr-terms#sun> . <http://amr.isi.edu/amr_data/test-1#s2> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/amr-terms#sun> .
<http://amr.isi.edu/amr_data/test-1#root01> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#AMR> .
<http://amr.isi.edu/amr_data/test-1#s> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/entity-types#star> . <http://amr.isi.edu/amr_data/test-1#s> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/entity-types#star> .
<http://amr.isi.edu/rdf/amr-terms#sun> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#Concept> . <http://amr.isi.edu/amr_data/test-2#root01> <http://amr.isi.edu/rdf/core-amr#has-id> "test-2" .
<http://amr.isi.edu/rdf/core-amr#Frame> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#Concept> . <http://amr.isi.edu/amr_data/test-2#root01> <http://amr.isi.edu/rdf/core-amr#has-sentence> "Earth is a planet." .
<http://amr.isi.edu/rdf/core-amr#Concept> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-Concept" .
<http://amr.isi.edu/amr_data/test-1#root01> <http://amr.isi.edu/rdf/core-amr#has-id> "test-1" .
<http://amr.isi.edu/rdf/core-amr#NamedEntity> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-Term" .
<http://amr.isi.edu/rdf/core-amr#Concept> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2000/01/rdf-schema#Class> .
<http://amr.isi.edu/amr_data/test-2#p> <http://www.w3.org/2000/01/rdf-schema#label> "Earth" .
<http://amr.isi.edu/amr_data/test-1#root01> <http://amr.isi.edu/rdf/core-amr#root> <http://amr.isi.edu/amr_data/test-1#s> .
<http://amr.isi.edu/amr_data/test-1#root01> <http://amr.isi.edu/rdf/core-amr#has-sentence> "The sun is a star." . <http://amr.isi.edu/amr_data/test-1#root01> <http://amr.isi.edu/rdf/core-amr#has-sentence> "The sun is a star." .
<http://amr.isi.edu/rdf/core-amr#NamedEntity> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#Concept> .
<http://amr.isi.edu/entity-types#planet> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#NamedEntity> .
<http://amr.isi.edu/rdf/core-amr#Role> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-Role" .
<http://amr.isi.edu/entity-types#star> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#NamedEntity> . <http://amr.isi.edu/entity-types#star> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#NamedEntity> .
<http://amr.isi.edu/amr_data/test-1#s> <http://amr.isi.edu/rdf/amr-terms#domain> <http://amr.isi.edu/amr_data/test-1#s2> .
<http://amr.isi.edu/amr_data/test-2#p> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/entity-types#planet> .
<http://amr.isi.edu/amr_data/test-1#root01> <http://amr.isi.edu/rdf/core-amr#root> <http://amr.isi.edu/amr_data/test-1#s> .
<http://amr.isi.edu/rdf/core-amr#Frame> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-PropBank-Frame" .
<http://amr.isi.edu/amr_data/test-2#p> <http://www.w3.org/2000/01/rdf-schema#label> "Earth" .
<http://amr.isi.edu/amr_data/test-2#root01> <http://amr.isi.edu/rdf/core-amr#has-id> "test-2" .
<http://amr.isi.edu/amr_data/test-2#root01> <http://amr.isi.edu/rdf/core-amr#root> <http://amr.isi.edu/amr_data/test-2#p> .
<http://amr.isi.edu/amr_data/test-2#root01> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#AMR> . <http://amr.isi.edu/amr_data/test-2#root01> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#AMR> .
<http://amr.isi.edu/rdf/core-amr#Concept> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-Concept" . <http://amr.isi.edu/rdf/core-amr#Frame> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-PropBank-Frame" .
<http://amr.isi.edu/frames/ld/v1.2.2/FrameRole> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-PropBank-Role" . <http://amr.isi.edu/frames/ld/v1.2.2/FrameRole> <http://www.w3.org/2000/01/rdf-schema#label> "AMR-PropBank-Role" .
<http://amr.isi.edu/rdf/core-amr#NamedEntity> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://amr.isi.edu/rdf/core-amr#Concept> .
@prefix ns1: <http://amr.isi.edu/rdf/core-amr#> .
@prefix ns2: <http://amr.isi.edu/rdf/amr-terms#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
ns1:Concept a rdfs:Class ;
rdfs:label "AMR-Concept" .
ns1:Role a rdfs:Class ;
rdfs:label "AMR-Role" .
<http://amr.isi.edu/amr_data/test-1#root01> a ns1:AMR ;
ns1:has-id "test-1" ;
ns1:has-sentence "The sun is a star." ;
ns1:root <http://amr.isi.edu/amr_data/test-1#s> .
<http://amr.isi.edu/amr_data/test-2#root01> a ns1:AMR ;
ns1:has-id "test-2" ;
ns1:has-sentence "Earth is a planet." ;
ns1:root <http://amr.isi.edu/amr_data/test-2#p> .
<http://amr.isi.edu/frames/ld/v1.2.2/FrameRole> a ns1:Role ;
rdfs:label "AMR-PropBank-Role" .
ns2:domain a ns1:Role .
ns1:Frame a ns1:Concept ;
rdfs:label "AMR-PropBank-Frame" .
<http://amr.isi.edu/amr_data/test-1#s> a <http://amr.isi.edu/entity-types#star> ;
ns2:domain <http://amr.isi.edu/amr_data/test-1#s2> .
<http://amr.isi.edu/amr_data/test-1#s2> a ns2:sun .
<http://amr.isi.edu/amr_data/test-2#p> a <http://amr.isi.edu/entity-types#planet> ;
rdfs:label "Earth" .
<http://amr.isi.edu/entity-types#planet> a ns1:NamedEntity .
<http://amr.isi.edu/entity-types#star> a ns1:NamedEntity .
ns2:sun a ns1:Concept .
ns1:NamedEntity a ns1:Concept ;
rdfs:label "AMR-EntityType",
"AMR-Term" .
.project
.pydevproject
.history
.settings
*.pyc
out.rdf
.DS_Store
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "{}"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright {yyyy} {name of copyright owner}
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
# AMR-LD (AMRs as Linked Data)
## Advantages of AMR-LD
- Sharable in standard W3C format
- Some reasoning for free (using RDF/OWL tools):
- amr:ARG0-of owl:inverseOf amr:ARG0
- inheritance reasoning:
```
If :p rdf:type amr-ne:enzyme .
ame-ne:enzyme rdfs:subClassOf amr-ne:protein .
Then :p rdf:type amr-ne:protein
```
- Linked to well know identifiers/entities
```
:e amr:xref up:RASH_HUMAN .
:e amr:xref pfam:PF00071 .
```
- Making semantic assertions of precise equivalence using `owl:sameAs` relations
```
:e owl:sameAs up:RASH_HUMAN .
```
- Query tools for free (SPARQL)
- Names of proteins in a AMR repository
```
select ?n
where { ?p rdf:type amr-ne:protein .
?p amr:name ?no .
?no rdf:type amr-ne:name .
?no :op1 ?n }
```
- All propbank events (verbs) that have a protein as :ARG1
```
select ?v
where { ?e rdf:type ?v .
?e amr:ARG1 ?p .
?p rdf:type amr-ne:protein .
### we would get all the "amr-ne:enzyme"s if reasoning enabled
}
```
- All propbank events (verbs) that have a protein as :ARG*
```
select ?v
where { ?e rdf:type ?v .
?e ?arg ?p .
?p rdf:type amr-ne:protein .
### we would get all the "amr-ne:enzyme"s if reasoning enabled
FILTER regex(?arg, "ARG", "i") }
}
```
## 2. AMR-LD example
The cjconsensus Gold-Standard data set contains the following AMR (from a sentence in the results section of [Innocenti et al. 2002](http://www.ncbi.nlm.nih.gov/pubmed/11777939)):
```
# ::id pmid_1177_7939.53 ::date 2015-03-07T10:57:15 ::annotator SDL-AMR-09 ::preferred
# ::snt Sos-1 has been shown to be part of a signaling complex with Grb2, which mediates the activation of Ras upon RTK stimulation.
# ::save-date Fri Mar 13, 2015 ::file pmid_1177_7939_53.txt
(s / show-01
:ARG1 (h / have-part-91
:ARG1 (m / macro-molecular-complex
:ARG0-of (s2 / signal-07)
:part (p2 / protein :name (n2 / name :op1 "Grb2")
:ARG0-of (m2 / mediate-01
:ARG1 (a / activate-01
:ARG1 (e / enzyme :name (n3 / name :op1 "Ras"))
:condition (s3 / stimulate-01
:ARG1 (e2 / enzyme :name (n4 / name :op1 "RTK")))))))
:ARG2 (p / protein :name (n / name :op1 "Sos-1"))))
```
Under our current rubric, this would then be translated to the following RDF TTL code, with a fairly simple one-to-one mapping from AMR elements to RDF elements.
```
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>.
@prefix xsd: <http://www.w3.org/2001/XMLSchema#>.
@prefix amr: <http://amr.isi.edu#>
@prefix pb: <https://verbs.colorado.edu/propbank#>
@prefix ontonotes: <https://catalog.ldc.upenn.edu/LDC2013T19#>
@prefix amr-ne: <http://amr.isi.edu/entity-types#>
@prefix up: <http://www.uniprot.org/uniprot/>
@prefix pfam: <http://pfam.xfam.org/family/>
### : is the default namespace
:a1 rdf:type amr:AMR .
:a1 amr:has-sentence "Sos-1 has been shown to be part of a signaling complex with Grb2, which mediates the activation of Ras upon RTK stimulation." .
:a1 amr:has-id "pmid_1177_7939.53"
:a1 amr:has-date "2015-03-07T10:57:15
:a1 amr:has-annotator SDL-AMR-09
:a1 amr:is-preferred "true"^^xsd:boolean
:a1 amr:has-file "pmid_1177_7939_53.txt"
:a1 amr:has-root :s . ### or :a1 amr:has-root :pmid_1177_7939.53__s
:s rdf:type pb:show-01 .
:s amr:ARG1 :h .
:h rdf:type pb:have-part-91 .
:h amr:ARG1 :m .
:m rdf:type amr-ne:macro-molecular-complex .
:m amr:ARG0-of :s2 .
:s2 rdf:type pb:signal-07 .
:m amr:part :p2 .
:p2 rdf:type amr-ne:protein .
:p2 :name :n2 .
:n2 rdf:type amr-ne:name .
:n2 amr:op1 "Grb2" .
:p2 amr:xref up:P62993 .
:p2 amr:xref up:GRB2_HUMAN .
:p2 amr:ARG0-of :m2 .
:m2 rdf:type pb:mediate-01 .
:m2 amr:ARG1 :a .
:a rdf:type pb:activate-01 .
:a amr:ARG1 :e .
:e rdf:type amr-ne:enzyme .
:e :name :n3 .
:n3 rdf:type amr-ne:name .
:n3 :op1 "Ras" .
:e amr:xref up:RASH_HUMAN . ### we could also do: :e owl:sameAs up:RASH_HUMAN
:e amr:xref pfam:PF00071 .
:a amr:condition :s3 .
:s3 rdf:type pb:stimulate-01 .
:s3 amr:ARG1 :e2 .
:e2 rdf:type amr-ne:enzyme .
:e2 amr:name :n4 .
:n4 rdf:type amr-ne:name .
:n4 :op1 "RTK" .
:e2 amr:xref pfam:PF07714
:h amr:ARG2 :p .
:p rdf:type amr-ne:protein .
:p amr:name :n .
:n rdf:type amr:name .
:n :op1 "Sos-1" .
```
## `amr_to_rdf.py`
This is a simple script that reads the AMR using libraries from the [AMRICA](https://github.com/nsaphra/AMRICA) toolkit and then transposes the graph structure of the AMR into RDF using [rdflib](https://github.com/RDFLib/rdflib). We use simple heuristics to map namespaces and to generate valid RDF for the AMR as needed.
How to run the script:
```
$ python amr_to_rdf.py -i <input AMR file> -o <output RDF file> [-f <format>]
```
How to test the script:
```
$ python amr_to_rdf.py -i test/bio_ras_0001_1.txt -o out.rdf -f nt
```
op(\d+)
(\w+)-quantity
([\w\-]+)-entity
accompanier, age
beneficiary
compared-to, concession, condition, consist-of
degree, destination, direction, domain, duration
example, extent
frequency
instrument
location
manner, medium, mod, mode
ord
part, path, polarity, poss, purpose
quant
scale, source, subevent
time, topic, unit
value
wiki
calendar, century, day, dayperiod, decade, era, month, quarter, season, timezone, weekday, year, year2
prep-against, prep-along-with, prep-amid, prep-among, prep-as, prep-at
prep-by
prep-for, prep-from
prep-in, prep-in-addition-to, prep-into
prep-on, prep-on-behalf-of, prep-out-of
prep-to, prep-toward
prep-under
prep-with, prep-without
conj-as-if
location-of
polarity
degree
mode
amr-unknown
interrogative
imperative
expressive
and
or
either
neither
after
near
between
all
no
that
more
too
most
subset, subset-of
wiki
product-of, sum-of
statistical-test
date-entity, date-interval
\ No newline at end of file
thing
person, family, animal, language, nationality, ethnic-group, regional-group, religious-group
organization, company, government-organization, military, criminal-organization, political-party
school, university, research-institute
team, league
location, city, city-district, county, state, province, territory, country, local-region, country-region, world-region, continent
ocean, sea, lake, river, gulf, bay, strait, canal
peninsula, mountain, volcano, valley, canyon, island, desert, forest
moon, planet, star, constellation
facility, airport, station, port, tunnel, bridge, road, railway-line, canal
building, theater, museum, palace, hotel, worship-place, sports-facility
market, park, zoo, amusement-park
event, incident, natural-disaster, earthquake, war, conference, game, festival
product, vehicle, ship, aircraft, aircraft-type, spaceship, car-make
work-of-art, picture, music, show, broadcast-program
publication, book, newspaper, magazine, journal
natural-object
law, treaty, award, food-dish, music-key
molecular-physical-entity, small-molecule, protein, protein-family, protein-segment, amino-acid, macro-molecular-complex, enzyme, rna
pathway, gene, dna-sequence, cell, cell-line, organism, disease
\ No newline at end of file
"""
A commandline tool for drawing RDF graphs in Graphviz DOT format
You can draw the graph of an RDF file directly:
.. code-block: bash
rdf2dot my_rdf_file.rdf | dot -Tpng | display
"""
import rdflib
import rdflib.extras.cmdlineutils
from rdflib.namespace import Namespace, NamespaceManager
import sys
import cgi
import collections
from rdflib import XSD
LABEL_PROPERTIES = [rdflib.RDFS.label,
rdflib.URIRef("http://purl.org/dc/elements/1.1/title"),
rdflib.URIRef("http://xmlns.com/foaf/0.1/name"),
rdflib.URIRef("http://www.w3.org/2006/vcard/ns#fn"),
rdflib.URIRef("http://www.w3.org/2006/vcard/ns#org")
]
XSDTERMS = [
XSD[x] for x in (
"anyURI", "base64Binary", "boolean", "byte", "date",
"dateTime", "decimal", "double", "duration", "float", "gDay", "gMonth",
"gMonthDay", "gYear", "gYearMonth", "hexBinary", "ID", "IDREF",
"IDREFS", "int", "integer", "language", "long", "Name", "NCName",
"negativeInteger", "NMTOKEN", "NMTOKENS", "nonNegativeInteger",
"nonPositiveInteger", "normalizedString", "positiveInteger", "QName",
"short", "string", "time", "token", "unsignedByte", "unsignedInt",
"unsignedLong", "unsignedShort")]
EDGECOLOR = "blue"
NODECOLOR = "black"
ISACOLOR = "black"
def rdf2dot(g, stream, opts={}):
"""
Convert the RDF graph to DOT
writes the dot output to the stream
"""
fields = collections.defaultdict(set)
nodes = {}
def node(x):
if x not in nodes:
nodes[x] = "node%d" % len(nodes)
return nodes[x]
def label(x, g):
for labelProp in LABEL_PROPERTIES:
l = g.value(x, labelProp)
if l:
return l
try:
return g.namespace_manager.compute_qname(x)[2]
except:
return x
def formatliteral(l, g):
v = cgi.escape(l)
if l.datatype:
return u'&quot;%s&quot;^^%s' % (v, qname(l.datatype, g))
elif l.language:
return u'&quot;%s&quot;@%s' % (v, l.language)
return u'&quot;%s&quot;' % v
def qname(x, g):
try:
q = g.compute_qname(x)
return q[0] + ":" + q[2]
except:
return x
def color(p):
return "BLACK"
stream.write(u"digraph { \n node [ fontname=\"DejaVu Sans\" ] ; \n")
for s, p, o in g:
sn = node(s)
if p == rdflib.RDFS.label:
continue
if isinstance(o, (rdflib.URIRef, rdflib.BNode)):
on = node(o)
opstr = u"\t%s -> %s [ color=%s, label=< <font point-size='14' " + \
u"color='#6666ff'>%s</font> > ] ;\n"
stream.write(opstr % (sn, on, color(p), qname(p, g)))
else:
fields[sn].add((qname(p, g), formatliteral(o, g)))
for u, n in nodes.items():
stream.write(u"# %s %s\n" % (u, n))
f = []
#f = [u"<tr><td align='left' width='40px'>%s</td><td align='left'>%s</td></tr>" %
# x for x in sorted(fields[n])]
nn = g.compute_qname(u)
uu = nn[0] + u":" + nn[2]
opstr = u"%s [ shape=none, color=%s label=< <table color='#666666'" + \
u" cellborder='0' cellspacing='0' border='1'><tr>" + \
u"<td href='%s' bgcolor='#ffffff' colspan='2'>" + \
u"<font point-size='14' color='#000000'>%s</font></td>" + \
u"</tr>%s</table> > ] \n"
stream.write(opstr % (n, NODECOLOR, label(u, g), uu, u"".join(f)))
stream.write("}\n")
def _help():
sys.stderr.write("""
rdf2dot.py [-f <format>] files...
Read RDF files given on STDOUT, writes a graph of the RDFS schema in DOT
language to stdout
-f specifies parser to use, if not given,
""")
def main():
rdflib.extras.cmdlineutils.main(rdf2dot, _help)
if __name__ == '__main__':
main()
\ No newline at end of file
#!/usr/bin/env python
"""
amr_to_jsonld.py
Note, this is derived from the source code to AMRICA's disagree_btwn_sents.py script by Naomi Saphra (nsaphra@jhu.edu)
Copyright(c) 2015. All rights reserved.
"""
import argparse
import argparse_config
import codecs
import os
import re
import json
from compare_smatch import amr_metadata
cur_sent_id = 0
def run_main(args):
try:
import rdflib
except ImportError:
raise ImportError('requires rdflib')
infile = codecs.open(args.infile, encoding='utf8')
outfile = open(args.outfile, 'w')
json_obj = []
# namespaces
amr_ns = rdflib.Namespace("http://amr.isi.edu/rdf/core-amr#")
pb_ns = rdflib.Namespace("https://verbs.colorado.edu/propbank#")
ontonotes_ns = rdflib.Namespace("https://catalog.ldc.upenn.edu/LDC2013T19#")
amr_ne_ns = rdflib.Namespace("http://amr.isi.edu/entity-types#")
up_ns = rdflib.Namespace("http://www.uniprot.org/uniprot/")
pfam_ns = rdflib.Namespace("http://pfam.xfam.org/family/")
ns_lookup = {}
nelist = []
nefile = codecs.open("ne.txt", encoding='utf8')
for l in nefile:
for w in re.split(",\s*", l):
nelist.append( w )
for ne in nelist:
ns_lookup[ne] = amr_ne_ns
amrs_same_sent = []
cur_id = ""
while True:
(amr_line, comments) = amr_metadata.get_amr_line(infile)
cur_amr = None
if amr_line:
cur_amr = amr_metadata.AmrMeta.from_parse(amr_line, comments)
if not cur_id:
cur_id = cur_amr.metadata['id']
if cur_amr is None or cur_id != cur_amr.metadata['id']:
amr = amrs_same_sent[0]
(inst, rel1, rel2) = amr.get_triples2()
# lookup from original amr objects and simple python objects
lookup = {}
context = {}
default = "http://amr.isi.edu/amr_data/" + amr.metadata['id'] + "#"
temp_ns = rdflib.Namespace(default)
a1 = {}
a1["@type"] = amr_ns.AMR.toPython()
json_obj.append(a1)
#:a1 amr:has-sentence "Sos-1 has been shown to be part of a signaling complex with Grb2, which mediates the activation of Ras upon RTK stimulation." .
a1['has-sentence'] = amr.metadata['snt']
#:a1 amr:has-id "pmid_1177_7939.53"
a1['@id'] = amr.metadata['id']
#:a1 amr:has-date "2015-03-07T10:57:15
a1['has-date'] = amr.metadata['date']
#:a1 amr:has-annotator SDL-AMR-09
#:a1 amr:is-preferred "true"^^xsd:boolean
#:a1 amr:has-file "pmid_1177_7939_53.txt"
amr_root = {}
lookup[amr.root] = amr_root
a1['root'] = amr_root
context['root'] = amr_ns.root.toPython()
context['@base'] = default
for (p, s, o) in inst:
if( ns_lookup.get(o,None) is not None ):
context[o] = amr_ne_ns[o].toPython()
elif( re.search('\-\d+$', o) is not None ):
context[o] = pb_ns[o].toPython()
else:
context[o] = amr_ns[o].toPython()
if( lookup.get(s,None) is None ):
lookup[s] = {}
s_obj = lookup[s]
s_obj["@id"] = s
s_obj["@type"] = o
for (p, s, o) in rel2:
if( lookup.get(s,None) is None ):
lookup[s] = {}
if( lookup.get(o,None) is None ):
lookup[o] = {}
s_obj = lookup[s]
o_obj = lookup[o]
if( s != o ):
s_obj[p] = o_obj
for (p, s, l) in rel1:
if( lookup.get(s,None) is None ):
lookup[s] = {}
s_obj = lookup[s]
o_obj = lookup[o]
s_obj[p] = l
a1['@context'] = context
amrs_same_sent = []
if cur_amr is not None:
cur_id = cur_amr.metadata['id']
else:
break
amrs_same_sent.append(cur_amr)
json.dump( json_obj, outfile, indent=2 )
outfile.close()
infile.close()
# gold_aligned_fh and gold_aligned_fh.close()
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--infile', help='amr input file')
parser.add_argument('-o', '--outfile', help='RDF output file')
args = parser.parse_args()
run_main(args)
#!/usr/bin/env python
"""
amr_to_rdf.py
Note, this is derived from the source code to AMRICA's disagree_btwn_sents.py script by Naomi Saphra (nsaphra@jhu.edu)
Copyright(c) 2015. All rights reserved.
"""
import argparse
#import argparse_config
import codecs
import os
import re
import textwrap
from compare_smatch import amr_metadata
from rdflib.namespace import RDF
from rdflib.namespace import RDFS
from rdflib.plugins import sparql
#from Carbon.QuickDraw import frame
from numpy.f2py.auxfuncs import throw_error
cur_sent_id = 0
def strip_word_alignments(str, patt):
match = patt.match(str)
if match:
str = match.group(1)
return str
def run_main(args):
inPath = args.inPath
outPath = args.outPath
#
# If the path is a directory then loop over the directory contents,
# Else run the script on the file as described
#
if( os.path.isfile(inPath) ):
run_main_on_file(args)
else:
if not os.path.exists(outPath):
os.makedirs(outPath)
for fn in os.listdir(inPath):
if os.path.isfile(inPath+"/"+fn) and fn.endswith(".txt"):
args.inPath =inPath + "/" + fn
args.outPath = outPath + "/" + fn + ".rdf"
run_main_on_file(args)
def run_main_on_file(args):
try:
import rdflib
except ImportError:
raise ImportError('requires rdflib')
infile = codecs.open(args.inPath, encoding='utf8')
outfile = open(args.outPath, 'w')
pBankRoles = True
if( not(args.pbankRoles == u'1') ):
pBankRoles = False
xref_namespace_lookup = {}
with open('xref_namespaces.txt') as f:
xref_lines = f.readlines()
for l in xref_lines:
line = re.split("\t", l)
xref_namespace_lookup[line[0]] = line[1].rstrip('\r\n')
# create the basic RDF data structure
g = rdflib.Graph()
# namespaces
amr_ns = rdflib.Namespace("http://amr.isi.edu/rdf/core-amr#")
amr_terms_ns = rdflib.Namespace("http://amr.isi.edu/rdf/amr-terms#")
amr_data = rdflib.Namespace("http://amr.isi.edu/amr_data#")
pb_ns = rdflib.Namespace("http://amr.isi.edu/frames/ld/v1.2.2/")
amr_ne_ns = rdflib.Namespace("http://amr.isi.edu/entity-types#")
up_ns = rdflib.Namespace("http://www.uniprot.org/uniprot/")
pfam_ns = rdflib.Namespace("http://pfam.xfam.org/family/")
ontonotes_ns = rdflib.Namespace("https://catalog.ldc.upenn.edu/LDC2013T19#")
g.namespace_manager.bind('propbank', pb_ns, replace=True)
g.namespace_manager.bind('amr-core', amr_ns, replace=True)
g.namespace_manager.bind('amr-terms', amr_terms_ns, replace=True)
g.namespace_manager.bind('entity-types', amr_ne_ns, replace=True)
g.namespace_manager.bind('amr-data', amr_data, replace=True)
for k in xref_namespace_lookup.keys():
temp_ns = rdflib.Namespace(xref_namespace_lookup[k])
g.namespace_manager.bind(k, temp_ns, replace=True)
xref_namespace_lookup[k] = temp_ns
# Basic AMR Ontology consisting of
# 1. concepts
# 2. roles
# 3. strings (which are actually going to be Literal(string)s
conceptClass = amr_ns.Concept
neClass = amr_ns.NamedEntity
frameClass = amr_ns.Frame
roleClass = amr_ns.Role
frameRoleClass = pb_ns.FrameRole
g.add( (conceptClass, rdflib.RDF.type, rdflib.RDFS.Class) )
g.add( (conceptClass, RDFS.label, rdflib.Literal("AMR-Concept") ) )
#g.add( (conceptClass, RDFS.comment, rdflib.Literal("Class of all concepts expressed in AMRs") ) )
g.add( (neClass, rdflib.RDF.type, conceptClass) )
g.add( (neClass, RDFS.label, rdflib.Literal("AMR-EntityType") ) )
#g.add( (neClass, RDFS.comment, rdflib.Literal("Class of all named entities expressed in AMRs") ) )
g.add( (neClass, rdflib.RDF.type, conceptClass) )
g.add( (neClass, RDFS.label, rdflib.Literal("AMR-Term") ) )
#g.add( (neClass, RDFS.comment, rdflib.Literal("Class of all named entities expressed in AMRs") ) )
g.add( (roleClass, rdflib.RDF.type, rdflib.RDFS.Class) )
g.add( (roleClass, RDFS.label, rdflib.Literal("AMR-Role") ) )
#g.add( (roleClass, RDFS.comment, rdflib.Literal("Class of all roles expressed in AMRs") ) )
g.add( (frameRoleClass, rdflib.RDF.type, roleClass) )
g.add( (frameRoleClass, RDFS.label, rdflib.Literal("AMR-PropBank-Role") ) )
#g.add( (frameRoleClass, RDFS.comment, rdflib.Literal("Class of all roles of PropBank frames") ) )
g.add( (frameClass, rdflib.RDF.type, conceptClass) )
g.add( (frameClass, RDFS.label, rdflib.Literal("AMR-PropBank-Frame") ) )
#g.add( (frameClass, RDFS.comment, rdflib.Literal("Class of all frames expressed in AMRs") ) )
amr_count = 0
ns_lookup = {}
class_lookup = {}
nelist = []
corelist = []
pattlist = []
pmid_patt = re.compile('.*pmid_(\d+)_(\d+).*')
word_align_patt = re.compile('(.*)\~e\.(.+)')
propbank_patt = re.compile('^(.*)\-\d+$')
opN_patt = re.compile('op(\d+)')
arg_patt = re.compile('ARG\d+')
with open('amr-ne.txt') as f:
ne_lines = f.readlines()
for l in ne_lines:
for w in re.split(",\s*", l):
w = w.rstrip('\r\n')
nelist.append( w )
for ne in nelist:
ns_lookup[ne] = amr_ne_ns
class_lookup[ne] = neClass
with open('amr-core.txt') as f:
core_lines = f.readlines()
for l in core_lines:
for w in re.split(",\s*", l):
w = w.rstrip('\r\n')
corelist.append( w )
for c in corelist:
ns_lookup[c] = amr_ns
class_lookup[c] = conceptClass
pattfile = codecs.open("amr-core-patterns.txt", encoding='utf8')
for l in pattfile:
pattlist.append( w )
amrs_same_sent = []
cur_id = ""
while True:
(amr_line, comments) = amr_metadata.get_amr_line(infile)
cur_amr = None
vb_lookup = {}
label_lookup_table = {}
xref_variables = {}
if amr_line:
cur_amr = amr_metadata.AmrMeta.from_parse(amr_line, comments)
if not cur_id:
cur_id = cur_amr.metadata['id']
if cur_amr is None or cur_id != cur_amr.metadata['id']:
amr = amrs_same_sent[0]
(inst, rel1, rel2) = amr.get_triples2()
temp_ns = rdflib.Namespace("http://amr.isi.edu/amr_data/" + amr.metadata['id'] + "#")
a1 = temp_ns.root01 # reserve term root01
# :a1 rdf:type amr:AMR .
g.add( (a1,
rdflib.RDF.type,
amr_ns.AMR) )
#:a1 amr:has-id "pmid_1177_7939.53"
amr_id = amr.metadata['id']
g.add( (a1,
amr_ns['has-id'],
rdflib.Literal(amr_id)))
match = pmid_patt.match(amr_id)
if match:
pmid = match.group(1) + match.group(2)
g.add( (a1,
amr_ns['has-pmid'],
rdflib.Literal(pmid)))
#:a1 amr:has-sentence "Sos-1 has been shown to be part of a signaling complex with Grb2, which mediates the activation of Ras upon RTK stimulation." .
if( amr.metadata.get('snt', None) is not None):
g.add( (a1,
amr_ns['has-sentence'],
rdflib.Literal(amr.metadata['snt']) )
)
#:a1 amr:has-date "2015-03-07T10:57:15
if( amr.metadata.get('date', None) is not None):
g.add( (a1,
amr_ns['has-date'],
rdflib.Literal(amr.metadata['date'])))
#:a1 amr:amr-annotator SDL-AMR-09
if( amr.metadata.get('amr-annotator', None) is not None):
g.add( (a1,
amr_ns['has-annotator'],
rdflib.Literal(amr.metadata['amr-annotator'])))
#:a1 amr:tok
if( amr.metadata.get('tok', None) is not None):
g.add( (a1,
amr_ns['has-tokens'],
rdflib.Literal(amr.metadata['tok'])))
#:a1 amr:alignments
if( amr.metadata.get('alignments', None) is not None):
g.add( (a1,
amr_ns['has-alignments'],
rdflib.Literal(amr.metadata['alignments'])))
g.add( (a1, amr_ns.root, temp_ns[amr.root]) )
# Add triples for setting types pointing to other resources
frames = {}
for (p, s, o) in inst:
o = strip_word_alignments(o,word_align_patt)
#if word_pos is not None:
# g.add( (temp_ns[s],
# amr_ns['has-word-pos'],
# rdflib.Literal(word_pos)) )
if( ns_lookup.get(o,None) is not None ):
resolved_ns = ns_lookup.get(o,None)
o_resolved = resolved_ns[o]
if( class_lookup.get(o,None) is not None):
g.add( (o_resolved, rdflib.RDF.type, class_lookup.get(o,None)) )
else:
raise ValueError(o_resolved + ' does not have a class assigned.')
elif( re.search('\-\d+$', o) is not None ):
#match = propbank_patt.match(o)
#str = ""
#if match:
# str = match.group(1)
#o_resolved = pb_ns[str + ".html#" +o ]
o_resolved = pb_ns[ o ]
g.add( (o_resolved, rdflib.RDF.type, frameClass) )
elif( o == 'xref' and args.fixXref):
continue
elif( not(o == 'name') ): # ignore 'name' objects but add all others.
o_resolved = amr_terms_ns[o]
g.add( (o_resolved, rdflib.RDF.type, conceptClass) )
# identify xref variables in AMR, don't retain it as a part of the graph.
else:
continue
frames[s] = o
g.add( (temp_ns[s], RDF.type, o_resolved) )
# Add object properties for local links in the current AMR
for (p, s, o) in rel2:
if( p == "TOP" ):
continue
# Do not include word positions for predicates
# (since they are more general and do not need to linked to everything).
p = strip_word_alignments(p,word_align_patt)
o = strip_word_alignments(o,word_align_patt)
# remember which objects have name objects
if( p == 'name' ):
label_lookup_table[o] = s
# objects with value objects should also be in
elif( p == 'xref' and args.fixXref):
xref_variables[o] = s
elif( re.search('^ARG\d+$', p) is not None ):
frameRole = frames[s] + "." + p
if( not(pBankRoles) ):
frameRole = p
g.add( (pb_ns[frameRole], rdflib.RDF.type, frameRoleClass) )
g.add( (temp_ns[s], pb_ns[frameRole], temp_ns[o] ) )
vb_lookup[s] = temp_ns[s]
vb_lookup[frameRole] = pb_ns[frameRole]
vb_lookup[o] = temp_ns[o]
elif( re.search('^ARG\d+\-of$', p) is not None ):
frameRole = frames[o] + "." + p
if( not(pBankRoles) ):
frameRole = p
g.add( (pb_ns[frameRole], rdflib.RDF.type, frameRoleClass) )
g.add( (temp_ns[s], pb_ns[frameRole], temp_ns[o] ) )
vb_lookup[s] = temp_ns[s]
vb_lookup[frameRole] = pb_ns[frameRole]
vb_lookup[o] = temp_ns[o]
else:
g.add( (amr_terms_ns[p], rdflib.RDF.type, roleClass) )
g.add( (temp_ns[s], amr_terms_ns[p], temp_ns[o]) )
vb_lookup[s] = temp_ns[s]
vb_lookup[p] = amr_terms_ns[p]
vb_lookup[o] = temp_ns[o]
# Add data properties in the current AMR
labels = {}
for (p, s, l) in rel1:
p = strip_word_alignments(p, word_align_patt)
l = strip_word_alignments(l, word_align_patt)
#
# Build labels across multiple 'op1, op2, ... opN' links,
#
opN_match = re.match(opN_patt, p)
if( opN_match is not None and
label_lookup_table.get(s,None) is not None):
opN = int(opN_match.group(1))
ss = label_lookup_table[s]
if( labels.get(ss, None) is None ):
labels[ss] = []
labels[ss].append( (opN, l) )
elif( xref_variables.get(s,None) is not None
and p == 'value'
and args.fixXref):
for k in xref_namespace_lookup.keys():
if( l.startswith(k) ):
l2 = l[-len(l)+len(k):]
xref_vb = xref_variables.get(s,None)
resolved_xref_vb = vb_lookup.get(xref_vb,None)
g.add( (resolved_xref_vb,
amr_ns['xref'],
xref_namespace_lookup[k][l2]) )
# Special treatment for propbank roles.
elif( re.search('ARG\d+$', p) is not None ):
frameRole = frames[s] + "." + p
if( not(pBankRoles) ):
frameRole = p
g.add( (pb_ns[frameRole], rdflib.RDF.type, frameRoleClass) )
g.add( (temp_ns[s], pb_ns[frameRole], rdflib.Literal(l) ) )
# Otherwise, it's just a literal
else:
g.add( (temp_ns[s], amr_terms_ns[p], rdflib.Literal(l) ) )
# Add labels here
# ["\n".join([i.split(' ')[j] for j in range(5)]) for i in g.vs["id"]]
for key in labels.keys():
labelArray = [i[1] for i in sorted(labels[key])];
label = " ".join( labelArray )
g.add( (temp_ns[key],
RDFS.label,
rdflib.Literal(label) ) )
amrs_same_sent = []
if cur_amr is not None:
cur_id = cur_amr.metadata['id']
else:
break
amrs_same_sent.append(cur_amr)
amr_count = amr_count+1
# Additional processing to clean up.
# 1. Add labels to AMR objects
#q = sparql.prepareQuery("select distinct ?s ?label " +
# "where { " +
# "?s <http://amr.isi.edu/rdf/core-amr#name> ?n . " +
# "?n <http://amr.isi.edu/rdf/core-amr#op1> ?label " +
# "}")
#qres = g.query(q)
#for row in qres:
# print("%s type %s" % row)
print ("%d AMRs converted" % amr_count)
outfile.write( g.serialize(format=args.format) )
outfile.close()
infile.close()
# gold_aligned_fh and gold_aligned_fh.close()
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('-i', '--inPath', help='AMR input file or directory')
parser.add_argument('-o', '--outPath', help='RDF output file or directory')
parser.add_argument('-pbr', '--pbankRoles', default='1', help='Do we include PropBank Roles?')
parser.add_argument('-kx', '--fixXref', default='1', help='Keep existing Xref formalism?')
parser.add_argument('-v', '--verbose', action='store_true')
parser.add_argument('-f', '--format', nargs='?', default='nt',
help="RDF Format: xml, n3, nt, trix, rdfa")
args = parser.parse_args()
run_main(args)
"""
amr_alignment.py
Author: Naomi Saphra (nsaphra@jhu.edu)
Copyright(c) 2014
Builds a weighted mapping of tokens between parallel sentences for use in
weighted cross-language Smatch alignment.
Takes in an output file from GIZA++ (specified in construction functions).
"""
from collections import defaultdict
from pynlpl.formats.giza import GizaSentenceAlignment
import re
class Amr2AmrAligner(object):
def __init__(self, num_best=5, num_best_in_file=-1, src2tgt_fh=None, tgt2src_fh=None):
if src2tgt_fh == None or tgt2src_fh == None:
self.is_default = True
self.node_weight_fn = self.dflt_node_weight_fn
self.edge_weight_fn = self.dflt_edge_weight_fn
else:
self.is_default = False
self.node_weight_fn = None
self.edge_weight_fn = self.xlang_edge_weight_fn
self.src2tgt_fh = src2tgt_fh
self.tgt2src_fh = tgt2src_fh
self.amr2amr = {}
self.num_best = num_best
self.num_best_in_file = num_best_in_file
self.last_nbest_line = {self.src2tgt_fh:None, self.tgt2src_fh:None}
if num_best_in_file < 0:
self.num_best_in_file = num_best
assert self.num_best_in_file >= self.num_best
def set_amrs(self, tgt_amr, src_amr):
if self.is_default:
return
self.tgt_toks = tgt_amr.metadata['tok'].strip().split()
self.src_toks = src_amr.metadata['tok'].strip().split()
sent2sent_union = align_sent2sent_union(self.tgt_toks, self.src_toks,
self.get_nbest_alignments(self.src2tgt_fh), self.get_nbest_alignments(self.tgt2src_fh))
if 'alignments' in tgt_amr.metadata:
amr2sent_tgt = align_amr2sent_jamr(tgt_amr, self.tgt_toks, tgt_amr.metadata['alignments'].strip().split())
else:
amr2sent_tgt = align_amr2sent_dflt(tgt_amr, self.tgt_toks)
if 'alignments' in src_amr.metadata:
amr2sent_src = align_amr2sent_jamr(src_amr, self.src_toks, src_amr.metadata['alignments'].strip().split())
else:
amr2sent_src = align_amr2sent_dflt(src_amr, self.src_toks)
self.amr2amr = defaultdict(float)
for (tgt_lbl, tgt_scores) in amr2sent_tgt.items():
for (src_lbl, src_scores) in amr2sent_src.items():
if src_lbl.lower() == tgt_lbl.lower():
self.amr2amr[(tgt_lbl, src_lbl)] += 1.0
continue
for (t, t_score) in enumerate(tgt_scores):
for (s, s_score) in enumerate(src_scores):
score = t_score * s_score * sent2sent_union[t][s]
if score > 0:
self.amr2amr[(tgt_lbl, src_lbl)] += score
self.node_weight_fn = lambda t,s : self.amr2amr[(t, s)]
def const_map_fn(self, const):
""" Get all const strings from source amr that could map to target const """
const_matches = [const]
for (t,s) in filter(lambda (t,s): t == const, self.amr2amr):
if self.node_weight_fn(t,s) > 0: # weight > 0
const_matches.append(s)
return sorted(const_matches, key=lambda x: self.node_weight_fn(const, x), reverse=True)
@staticmethod
def dflt_node_weight_fn(tgt_label, src_label):
return 1.0 if tgt_label.lower() == src_label.lower() else 0.0
@staticmethod
def dflt_edge_weight_fn(tgt_label, src_label):
return 1.0 if tgt_label.lower() == src_label.lower() else 0.0
def xlang_edge_weight_fn(self, tgt_label, src_label):
tgt = tgt_label.lower()
src = src_label.lower()
if tgt == src:
# operand edges are all equivalent
#TODO make this an RE instead?
return 1.0
if tgt.startswith("op") and src.startswith("op"):
return 0.9 # frumious hack to favor similar op edges
return 0.0
def get_nbest_alignments(self, fh):
""" Read an entry from the giza alignment .A3 NBEST file. """
aligns = []
curr_sent = -1
start_ind = 0
if self.last_nbest_line[fh]:
if self.num_best > 0:
aligns.append(self.last_nbest_line[fh])
start_ind = 1
curr_sent = self.last_nbest_line[fh][0].index
self.last_nbest_line[fh] = None
for ind in range(start_ind, self.num_best_in_file):
meta_line = fh.readline()
if meta_line == "":
if len(aligns) == 0:
return None
else:
break
meta = re.match("# Sentence pair \((\d+)\) "+
"source length (\d+) target length (\d+) "+
"alignment score : (.+)", meta_line)
if not meta:
raise Exception
sent = int(meta.group(1))
if curr_sent < 0:
curr_sent = sent
score = float(meta.group(4))
tgt_line = fh.readline()
src_line = fh.readline()
if sent != curr_sent:
self.last_nbest_line[fh] = (GizaSentenceAlignment(src_line, tgt_line, sent), score)
break
if ind < self.num_best:
aligns.append((GizaSentenceAlignment(src_line, tgt_line, sent), score))
return aligns
default_aligner = Amr2AmrAligner()
def get_all_labels(amr):
ret = [v for v in amr.var_values]
for l in amr.const_links:
ret += [v for (k,v) in l.items()]
return ret
def align_amr2sent_dflt(amr, sent):
labels = get_all_labels(amr)
align = {l:[0.0 for tok in sent] for l in labels}
for label in labels:
lbl = label.lower()
# checking for multiwords / bad segmentation
# ('_' replaces ' ' in multiword quotes)
# TODO just fix AMR format parser to deal with spaces in quotes
possible_toks = lbl.split('_')
possible_toks.append(lbl)
matches = [t_ind for (t_ind, t) in enumerate(sent) if t.lower() in possible_toks]
for t_ind in matches:
align[label][t_ind] = 1.0 / len(matches)
return align
def parse_jamr_alignment(chunk):
(tok_range, nodes_str) = chunk.split('|')
(start_tok, end_tok) = tok_range.split('-')
node_list = nodes_str.split('+')
return (int(start_tok), int(end_tok), node_list)
def align_label2toks_en(label, sent, weights, toks_to_align):
"""
label: node label to map
sent: token list to map label to
weights: list to be modified with new weights
default_full: set True to have the default distribution sum to 1 instead of 0
return list mapping token index to match weight
"""
# TODO frumious hack. should set up actual stemmer sometime.
lbl = label.lower()
stem = lbl
wordnet = re.match("(.+)-\d\d", lbl)
if wordnet:
stem = wordnet.group(1)
if len(stem) > 4: # arbitrary
if len(stem) > 5:
stem = stem[:-2]
else:
stem = stem[:-1]
def is_match(tok):
return tok == lbl or \
(len(tok) >= len(stem) and tok[:len(stem)] == stem)
matches = [t_ind for t_ind in toks_to_align if is_match(sent[t_ind].lower())]
if len(matches) == 0:
matches = toks_to_align
for t_ind in matches:
weights[t_ind] += 1.0 / len(matches)
return weights
def align_amr2sent_jamr(amr, sent, jamr_line):
"""
amr: an amr to map nodes to sentence toks
sent: sentence array of toks
jamr_line: metadata field 'alignments', aligned with jamr
return dict mapping amr node labels to match weights for each tok in sent
"""
labels = get_all_labels(amr)
labels_remain = {label:labels.count(label) for label in labels}
tokens_remain = set(range(len(sent)))
align = {l:[0.0 for tok in sent] for l in labels}
for chunk in jamr_line:
(start_tok, end_tok, node_list) = parse_jamr_alignment(chunk)
for node_path in node_list:
label = amr.path2label[node_path]
toks_to_align = range(start_tok, end_tok)
align[label] = align_label2toks_en(label, sent, align[label], toks_to_align)
labels_remain[label] -= 1
for t in toks_to_align:
tokens_remain.discard(t)
#TODO should really switch from a label-token-label alignment model to node-token-node
for label in labels_remain:
if labels_remain[label] > 0:
align[label] = align_label2toks_en(label, sent, align[label], tokens_remain)
for label in align:
z = sum(align[label])
if z == 0:
continue
align[label] = [w/z for w in align[label]]
return align
def align_sent2sent(tgt_toks, src_toks, alignment_scores):
z = sum([s for (a,s) in alignment_scores])
tok_align = [[0.0 for s in src_toks] for t in tgt_toks]
for (align, score) in alignment_scores:
for srcind, tgtind in align.alignment:
if tgtind >= 0 and srcind >= 0:
tok_align[tgtind][srcind] += score
for targetind, targettok in enumerate(tgt_toks):
for sourceind, sourcetok in enumerate(src_toks):
tok_align[targetind][sourceind] /= z
return tok_align
def align_sent2sent_union(tgt_toks, src_toks, src2tgt, tgt2src):
src2tgt_align = align_sent2sent(tgt_toks, src_toks, src2tgt)
tgt2src_align = align_sent2sent(src_toks, tgt_toks, tgt2src)
tok_align = [[0.0 for s in src_toks] for t in tgt_toks]
for tgtind, tgttok in enumerate(tgt_toks):
for srcind, srctok in enumerate(src_toks):
tok_align[tgtind][srcind] = \
(src2tgt_align[tgtind][srcind] + tgt2src_align[srcind][tgtind]) / 2.0
return tok_align
#!/usr/bin/env python
"""
amr_metadata.py
Author: Naomi Saphra (nsaphra@jhu.edu)
Copyright(c) 2014
Read AMR file in while also processing metadata in comments
"""
import re
from smatch.amr import AMR
class AmrMeta(AMR):
def __init__(self, var_list=None, var_value_list=None,
link_list=None, const_link_list=None, path2label=None,
base_amr=None, metadata={}):
if base_amr is None:
super(AmrMeta, self).__init__(var_list, var_value_list,
link_list, const_link_list, path2label)
else:
self.nodes = base_amr.nodes
self.root = base_amr.root
self.var_values = base_amr.var_values
self.links = base_amr.links
self.const_links = base_amr.const_links
self.path2label = base_amr.path2label
self.metadata = metadata
@classmethod
def from_parse(cls, annotation_line, comment_lines, xlang=False):
metadata = {}
for l in comment_lines:
matches = re.findall(r'::(\S+)\s(([^:]|:(?!:))+)', l)
for m in matches:
metadata[m[0]] = m[1].strip()
base_amr = AMR.parse_AMR_line(annotation_line, xlang=xlang)
return cls(base_amr=base_amr, metadata=metadata)
def get_amr_line(infile):
""" Read an entry from the input file. AMRs are separated by blank lines. """
cur_comments = []
cur_amr = []
has_content = False
for line in infile:
if line[0] == "(" and len(cur_amr) != 0:
cur_amr = []
if line.strip() == "":
if not has_content:
continue
else:
break
elif line.strip().startswith("#"):
cur_comments.append(line.strip())
else:
has_content = True
cur_amr.append(line.strip())
return ("".join(cur_amr), cur_comments)
#!/usr/bin/env python
"""
smatch_graph.py
Author: Naomi Saphra (nsaphra@jhu.edu)
Copyright(c) 2014
Describes a class for building graphs of AMRs with disagreements hilighted.
"""
import copy
import networkx as nx
import pygraphviz as pgz
from pynlpl.formats.giza import GizaSentenceAlignment
from amr_alignment import Amr2AmrAligner
from amr_alignment import default_aligner
import amr_metadata
from smatch import smatch
GOLD_COLOR = 'blue'
TEST_COLOR = 'red'
DFLT_COLOR = 'black'
class SmatchGraph:
def __init__(self, inst, rel1, rel2, \
gold_inst_t, gold_rel1_t, gold_rel2_t, \
match, const_map_fn=default_aligner.const_map_fn):
"""
TODO correct these params
Input:
(inst, rel1, rel2) from test amr.get_triples2()
(gold_inst_t, gold_rel1_t, gold_rel2_t) from gold amr2dict()
match from smatch
const_map_fn returns a sorted list of gold label matches for a test label
"""
(self.inst, self.rel1, self.rel2) = (inst, rel1, rel2)
(self.gold_inst_t, self.gold_rel1_t, self.gold_rel2_t) = \
(gold_inst_t, gold_rel1_t, gold_rel2_t)
self.match = match # test var index -> gold var index
self.map_fn = const_map_fn
(self.unmatched_inst, self.unmatched_rel1, self.unmatched_rel2) = \
[copy.deepcopy(x) for x in (self.gold_inst_t, self.gold_rel1_t, self.gold_rel2_t)]
self.gold_ind = {} # test variable hash -> gold variable index
self.G = nx.MultiDiGraph()
def smatch2graph(self, node_weight_fn=None, edge_weight_fn=None):
"""
Returns graph of test AMR / gold AMR union, with hilighted disagreements for
different labels on edges and nodes, unmatched nodes and edges.
"""
for (ind, (i, v, instof)) in enumerate(self.inst):
self.add_inst(ind, v, instof)
for (reln, v, const) in self.rel1:
self.add_rel1(reln, v, const)
for (reln, v1, v2) in self.rel2:
self.add_rel2(reln, v1, v2)
if node_weight_fn and edge_weight_fn:
self.unmatch_dead_nodes(node_weight_fn, edge_weight_fn)
# Add gold standard elements not in test
test_ind = {v:k for (k,v) in self.gold_ind.items()} # reverse lookup from gold ind
for (ind, instof) in self.unmatched_inst.items():
test_ind[ind] = u'GOLD %s' % ind
self.add_node(test_ind[ind], '', instof, test_ind=-1, gold_ind=ind)
for ((ind, const), relns) in self.unmatched_rel1.items():
for reln in relns:
const_hash = test_ind[ind] + ' ' + const
if const_hash not in test_ind:
test_ind[const_hash] = const_hash
self.add_node(const_hash, '', const)
self.add_edge(test_ind[ind], test_ind[const_hash], '', reln)
for ((ind1, ind2), relns) in self.unmatched_rel2.items():
for reln in relns:
self.add_edge(test_ind[ind1], test_ind[ind2], '', reln)
return self.G
def get_text_alignments(self):
""" Return an array of variable ID mappings, including labels, that are human-readable.
Call only after smatch2graph(). """
align = []
for (v, attr) in self.G.nodes(data=True):
if attr['test_ind'] < 0 and attr['gold_ind'] < 0:
continue
align.append("%s\t%s\t-\t%s\t%s" % (attr['test_ind'], attr['test_label'], attr['gold_ind'], attr['gold_label']))
return align
def add_edge(self, v1, v2, test_lbl, gold_lbl):
assert(gold_lbl == '' or test_lbl == '' or gold_lbl == test_lbl)
if gold_lbl == '':
self.G.add_edge(v1, v2, label=test_lbl, test_label=test_lbl, gold_label=gold_lbl, color=TEST_COLOR)
elif test_lbl == '':
self.G.add_edge(v1, v2, label=gold_lbl, test_label=test_lbl, gold_label=gold_lbl, color=GOLD_COLOR)
elif test_lbl == gold_lbl:
self.G.add_edge(v1, v2, label=test_lbl, test_label=test_lbl, gold_label=gold_lbl, color=DFLT_COLOR)
def add_node(self, v, test_lbl, gold_lbl, test_ind=-1, gold_ind=-1):
assert(gold_lbl or test_lbl)
if gold_lbl == '':
self.G.add_node(v, label=u'%s / *' % test_lbl, test_label=test_lbl, gold_label=gold_lbl, \
test_ind=test_ind, gold_ind=gold_ind, color=TEST_COLOR)
elif test_lbl == '':
self.G.add_node(v, label=u'* / %s' % gold_lbl, test_label=test_lbl, gold_label=gold_lbl, \
test_ind=test_ind, gold_ind=gold_ind, color=GOLD_COLOR)
elif test_lbl == gold_lbl:
self.G.add_node(v, label=test_lbl, test_label=test_lbl, gold_label=gold_lbl, \
test_ind=test_ind, gold_ind=gold_ind, color=DFLT_COLOR)
else:
self.G.add_node(v, label=u'%s / %s' % (test_lbl, gold_lbl), test_label=test_lbl, gold_label=gold_lbl, \
test_ind=test_ind, gold_ind=gold_ind, color=DFLT_COLOR)
def add_inst(self, ind, var, instof):
self.gold_ind[var] = self.match[ind]
gold_lbl = ''
gold_ind = self.match[ind]
if gold_ind >= 0: # there's a gold match
gold_lbl = self.gold_inst_t[gold_ind]
if self.match[ind] in self.unmatched_inst:
del self.unmatched_inst[gold_ind]
self.add_node(var, instof, gold_lbl, test_ind=ind, gold_ind=gold_ind)
def add_rel1(self, reln, var, const):
const_matches = self.map_fn(const)
gold_edge_lbl = ''
# we match const to the highest-ranked match label from the var
gold_node_lbl = ''
node_hash = var+' '+const
for const_match in const_matches:
if (self.gold_ind[var], const_match) in self.gold_rel1_t:
gold_node_lbl = const_match
#TODO put the metatable editing in the helper fcns?
if reln not in self.gold_rel1_t[(self.gold_ind[var], const_match)]:
# relns between existing nodes should be in unmatched rel2
self.gold_ind[node_hash] = const_match
self.unmatched_rel2[(self.gold_ind[var], const_match)] = self.unmatched_rel1[(self.gold_ind[var], const_match)]
del self.unmatched_rel1[(self.gold_ind[var], const_match)]
else:
gold_edge_lbl = reln
self.unmatched_rel1[(self.gold_ind[var], const_match)].remove(reln)
break
self.add_node(node_hash, const, gold_node_lbl)
self.add_edge(var, node_hash, reln, gold_edge_lbl)
def add_rel2(self, reln, v1, v2):
gold_lbl = ''
if (self.gold_ind[v1], self.gold_ind[v2]) in self.gold_rel2_t:
if reln in self.gold_rel2_t[(self.gold_ind[v1], self.gold_ind[v2])]:
gold_lbl = reln
self.unmatched_rel2[(self.gold_ind[v1], self.gold_ind[v2])].remove(reln)
self.add_edge(v1, v2, reln, gold_lbl)
def unmatch_dead_nodes(self, node_weight_fn, edge_weight_fn):
""" Unmap node mappings that don't increase smatch score. """
node_is_live = {v:(gold == -1) for (v, gold) in self.gold_ind.items()}
for (v, attr) in self.G.nodes(data=True):
if node_weight_fn(attr['test_label'], attr['gold_label']) > 0:
node_is_live[v] = True
for (v1, links) in self.G.adjacency_iter():
for (v2, edges) in links.items():
if len(edges) > 1:
node_is_live[v2] = True
node_is_live[v1] = True
break
for (ind, attr) in edges.items():
if attr['test_label'] == attr['gold_label']:
node_is_live[v2] = True
node_is_live[v1] = True
break
for v in node_is_live.keys():
if not node_is_live[v]:
self.unmatched_inst[self.gold_ind[v]] = self.G.node[v]['gold_label']
self.G.node[v]['gold_label'] = ''
self.G.node[v]['label'] = u'%s / *' % self.G.node[v]['test_label']
self.G.node[v]['color'] = TEST_COLOR
del self.gold_ind[v]
def amr2dict(inst, rel1, rel2):
""" Get tables of AMR data indexed by variable number """
node_inds = {}
inst_t = {}
for (ind, (i, v, label)) in enumerate(inst):
node_inds[v] = ind
inst_t[ind] = label
rel1_t = {}
for (label, v1, const) in rel1:
if (node_inds[v1], const) not in rel1_t:
rel1_t[(node_inds[v1], const)] = set()
rel1_t[(node_inds[v1], const)].add(label)
rel2_t = {}
for (label, v1, v2) in rel2:
if (node_inds[v1], node_inds[v2]) not in rel2_t:
rel2_t[(node_inds[v1], node_inds[v2])] = set()
rel2_t[(node_inds[v1], node_inds[v2])].add(label)
return (inst_t, rel1_t, rel2_t)
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment