Skip to content
Snippets Groups Projects
Commit 833bce36 authored by Aurélien Lamercerie's avatar Aurélien Lamercerie
Browse files

Initial commit

parents
No related branches found
No related tags found
No related merge requests found
ssc-env/*
# Solar System Corpus
-------------------------------------------------------------------------------
This repository gathers experimental data about the solar system,
and some useful scripts/programs to obtain this data.
## Data
The data is organized into the following folders:
- abstractText: raw text data from [DBPedia](https://dbpedia.org/)
- amrGraph: sentence representations as AMRs Graph
- amrLk: sentence representations as AMR Linked Data
## Script: <convert_text_to_amr.py>
This script converts raw texts into AMR representations. It can be
adapted as needed. Especially, parameters can be adjusted to specify the data
to be processed.
### Installation
This project was developp under Python 3 and Manjaro Linux system, but it should
run on any common system.
First, it is recommended to use a
[virtual environment](https://docs.python.org/fr/3/tutorial/venv.html).
For example, 'ssc-env' can be create and use with the following commands:
python3 -m venv ssc-env
source ssc-env/bin/activate
The necessary libraries are defined in the file 'requirements.txt'.
They can be installed in the virtual environment using package installer as
pip:
pip install -r requirements.txt
See specific installation instructions about amrlib
([amrlib-install](https://amrlib.readthedocs.io/en/latest/install/)).
So, it is necessary to install the models used by amrlib library.
Models can be downloaded from
[amrlib-models](https://github.com/bjascob/amrlib-models).
These files need to be extracted and reside in the install directory under
amrlib/data and should be named model_stog for the default parse model.
The default models is loaded with `stog = amrlib.load_stog_model()'.
To have multiple models of the same type, you'll need to supply the directory name
when loading, ie
'stog = amrlib.load_stog_model(model_dir='.../amrlib/data/model_parse_t5-v0_1_0')'.
### Usage
Parameters can be adjusted in the script code as needed. The script take data ref
as argument (for example, 'test'). It can be run using command line:
python3 convert_text_to_amr.py test
# References
-------------------------------------------------------------------------------
[amrlib](https://github.com/bjascob/amrlib):
A python library that makes AMR parsing, generation and visualization simple.
[amr-ld](https://github.com/BMKEG/amr-ld/):
A Python library for mapping AMRs to linked data formats (such as RDF and JSON-LD).
Burns, G.A., Hermjakob, U., Ambite, J.L. (2016).
Abstract Meaning Representations as Linked Data.
In: , et al. The Semantic Web – ISWC 2016. ISWC 2016.
Lecture Notes in Computer Science(), vol 9982. Springer, Cham.
https://doi.org/10.1007/978-3-319-46547-0_2
[DBPedia](https://www.dbpedia.org/about/):
A crowd-sourced community effort to extract structured content from the information
created in various Wikimedia projects.
The Solar System is the gravitationally bound system of the Sun and the objects that orbit it, either directly or indirectly. Of the objects that orbit the Sun directly, the largest are the eight planets, with the remainder being smaller objects, the dwarf planets and small Solar System bodies. Of the objects that orbit the Sun indirectly—the natural satellites—two are larger than the smallest planet, Mercury, and one more almost equals it in size. The Solar System formed 4.6 billion years ago from the gravitational collapse of a giant interstellar molecular cloud. The vast majority of the system's mass is in the Sun, with the majority of the remaining mass contained in Jupiter. The four smaller inner system planets, Mercury, Venus, Earth and Mars, are terrestrial planets, being primarily composed of rock and metal. The four outer system planets are giant planets, being substantially more massive than the terrestrials. The two largest planets, Jupiter and Saturn, are gas giants, being composed mainly of hydrogen and helium; the two outermost planets, Uranus and Neptune, are ice giants, being composed mostly of substances with relatively high melting points compared with hydrogen and helium, called volatiles, such as water, ammonia and methane. All eight planets have almost circular orbits that lie within a nearly flat disc called the ecliptic. The Solar System also contains smaller objects. The asteroid belt, which lies between the orbits of Mars and Jupiter, mostly contains objects composed, like the terrestrial planets, of rock and metal. Beyond Neptune's orbit lie the Kuiper belt and scattered disc, which are populations of trans-Neptunian objects composed mostly of ices, and beyond them a newly discovered population of sednoids. Within these populations, some objects are large enough to have rounded under their own gravity, though there is considerable debate as to how many there will prove to be. Such objects are categorized as dwarf planets. Astronomers generally accept at least nine objects as dwarf planets: the asteroid Ceres and the trans-Neptunian objects Pluto, Eris, Haumea, Makemake, Gonggong, Quaoar, Sedna, and Orcus. In addition to these two regions, various other small-body populations, including comets, centaurs and interplanetary dust clouds, freely travel between regions. Six of the planets, the six largest possible dwarf planets, and many of the smaller bodies are orbited by natural satellites, usually termed "moons" after the Moon. Each of the outer planets is encircled by planetary rings of dust and other small objects. The solar wind, a stream of charged particles flowing outwards from the Sun, creates a bubble-like region in the interstellar medium known as the heliosphere. The heliopause is the point at which pressure from the solar wind is equal to the opposing pressure of the interstellar medium; it extends out to the edge of the scattered disc. The Oort cloud, which is thought to be the source for long-period comets, may also exist at a distance roughly a thousand times further than the heliosphere. The Solar System is located 26,000 light-years from the center of the Milky Way galaxy in the Orion Arm, which contains most of the visible stars in the night sky. The nearest stars are within the so-called Local Bubble, with the closest, Proxima Centauri, at 4.25 light-years. (en)
The sun is a star. Earth is a planet.
This diff is collapsed.
# ::snt The sun is a star.
(s / star
:domain (s2 / sun))
# ::snt Earth is a planet.
.
(p / planet
:domain p
:name (n / name
:op1 "Earth"))
# ::snt The sun is a star.
(s / star
:domain (s2 / sun))
# ::snt Earth is a planet.
.
(p / planet
:domain p
:name (n / name
:op1 "Earth"))
# ::snt The sun is a star.
(s / star
:domain (s2 / sun))
# ::snt Earth is a planet.
.
(p / planet
:domain p
:name (n / name
:op1 "Earth"))
# ::snt The sun is a star.
(s / star
:domain (s2 / sun))
# ::snt Earth is a planet.
.
(p / planet
:domain p
:name (n / name
:op1 "Earth"))
# ::snt The sun is a star.
(s / star
:domain (s2 / sun))
# ::snt Earth is a planet..
(p / planet
:domain (p2 / planet
:name (n / name
:op1 "Earth")))
# ::snt The sun is a star.
(s / star
:domain (s2 / sun))
# ::snt Earth is a planet..
(p / planet
:domain (p2 / planet
:name (n / name
:op1 "Earth")))
# ::snt The sun is a star.
(s / star
:domain (s2 / sun))
# ::snt .
(a / amr-empty)
# ::snt The sun is a star.
(s / star
:domain (s2 / sun))
# ::snt Earth is a planet.
(p / planet
:domain p
:name (n / name
:op1 "Earth"))
#!/usr/bin/python3.10
# -*-coding:Utf-8 -*
#==============================================================================
# Solar System Corpus: convert text to amr
#------------------------------------------------------------------------------
# Script to convert raw text into AMRs Graph
#==============================================================================
#==============================================================================
# Importing required modules
#==============================================================================
import amrlib
import re
import sys
#==============================================================================
# Parameters
#==============================================================================
# Useful directories
RAW_DIR = "abstractText/"
AMR_GRAPH_DIR = "amrGraph/"
AMR_LK_DIR = "amrLk/"
# AMR Lib Models
AMR_LIB_DATA = '/home/lamenji/.local/lib/python3.10/site-packages/amrlib/data/'
AMR_MODEL_XFM_LARGE = AMR_LIB_DATA + 'model_parse_xfm_bart_large-v0_1_0'
amr_model = AMR_MODEL_XFM_LARGE
#==============================================================================
# Utilities
#==============================================================================
def is_valid_sentence(sentence):
""" True if the sentence is correct.
"""
is_empty = ((sentence == "") | (sentence == "\n"))
lang_mark_re = re.compile("\([a-z]+\)(.)*")
is_language_mark = lang_mark_re.match(sentence) is not None
return not (is_empty | is_language_mark)
def clean_sentence(sentence):
""" Sentence cleanup as needed """
sentence = re.sub("(\.)*\\n", "", sentence)
return sentence
#==============================================================================
# Main Functions
#==============================================================================
def read_input_file(filename):
print("-- Reading input files (" +
filename +
") to recover a list of sentences")
sentences_list = list()
input_file = RAW_DIR + filename + ".txt"
with open(input_file, "r") as reading_file: # r = read
for line in reading_file.readlines():
sentences = line.split(". ")
for sentence in sentences:
if is_valid_sentence(sentence):
sentence = clean_sentence(sentence)
sentences_list.append(sentence + ".")
return sentences_list
def convert_sentences_to_graph(model, sentences):
print("-- Converting text sentences to AMR graphs")
stog = amrlib.load_stog_model(model_dir=model)
graphs = []
i = 1
for sentence in sentences:
print("--- parse sentence " + str(i) + ": " + sentence)
graphs.extend(stog.parse_sents([sentence]))
i += 1
return graphs
def write_output_file(graphs, filename):
print("-- Writing AMR graphs to input files (" +
filename +
")")
output_file = AMR_GRAPH_DIR + filename + ".txt"
with open(output_file, "a") as writing_file: # a = append
for graph in graphs:
out_graph = graph + "\n\n"
print(out_graph)
writing_file.write(out_graph)
# Data
test = "test"
ssc_00_1 = "solar-system"
data_name = test
#==============================================================================
# Main function
#==============================================================================
def main(data_ref):
# -- Prepare the sentences to be converted
print("\n" + "[SSC] Data Preparation")
source_sentences = read_input_file(data_ref)
print(source_sentences)
print("-- number of sentences: " + str(len(source_sentences)))
# -- Convert sentences to graphs
print("\n" + "[SSC] Convert sentences to graphs")
print("-- library: amrlib")
print("-- model: " + amr_model)
graphs = convert_sentences_to_graph(amr_model, source_sentences)
write_output_file(graphs, data_ref)
if __name__ == "__main__":
main(sys.argv[1])
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment