ShortPathQA

Subgraphs Extraction

The extraction protocal can be divided into 2 steps.

Parsing the Wikidata dump to build our Wikidata graph via iGraph.
Load our Igraph representation of Wikidata and generate the subgraph dataset.

All subgraphs extraction codes can be found in subgraphs_dataset_creation/.

Parsing Wikidata Dump

Wikidata frequently releases and updates their dumps, which can be found in various formats: JSON, RDF, XML, etc. Firstly, to utilise our code, download the dumps in JSON format. Then, to parse the wikidata json dump, run:

python3 subgraphs_dataset_creation/dump_repacker.py --data_path /path/to/downloaded_dump --save_path /path/to/parse_dump

where the arguments:

data_path refers to the path where the json dump is stored.
save_path refers to the path where we want to save the igraph triple triples representation.

After running the above script, a wikidata_triples.txt file will be created within the saved path mentioned in the argument above. This triples text file is ready to be loaded via Igraph via:

# graph_path is where we stored wikidata_triples.txt
igraph_wikidata = Graph.Read_Ncol(
            graph_path, names=True, directed=True, weights=True
        )

Since parsing this Wikidata dump takes a long time, checkpoints were implemented. If for some unfortunate reason, our process crashed, you can simply rerun subgraphs_dataset_creation/dump_repacker.py. The code will automatically continue parsing on where the crash happened.

Building the Subgraphs

After we have parsed the Wikidata dump and have our Igraph triples representation, we are ready for subgraphs dataset generation. Firstly, we need to pre-proccess Mintaka (fetch label for each Wikidata question entities, prepare the answer candidates by our LLM; all in 1 accessible jsonl file). To do so, please run the jupyter notebook subgraphs_dataset_creation/mintaka_subgraphs_preparing.ipynb. The input of this notebook are the answer candidates generated by our LLMs (.csv and .json formatting for T5-like and Mixtral/Mistral respectively).

Finally, to fetch the desired subgraphs, run:

python3 subgraphs_dataset_creation/mining_subgraphs_dataset_processes.py

which have the following available arguments:

--save_jsonl_path indicates the path of the final resulting jsonl file (with our subgraphs).
--igraph_wikidata_path indicates the path of the file with our Igraph triples representation.
--subgraphs_dataset_prepared_entities_jsonl_path indicates the path of the preproccessed Mintaka dataset, output of subgraphs_dataset_creation/mintaka_subgraphs_preparing.ipynb.
--n_jobs indicates how many jobs for our multi-processing scheme. ATTENTION: Each process require ~60-80Gb RAM.
--skip_lines indicates the number of lines for skip in prepared_entities_jsonl file (from --subgraphs_dataset_prepared_entities_jsonl_path).

After running the above file, the final data will be a jsonl file in the path --save_jsonl_path

Dataset formattings:

Each entry in the final .jsonl file will represent one question-answer pair and its corresponding subgraph. One sample entry can be seen below:

{"id":"fae46b21","question":"What man was a famous American author and also a steamboat pilot on the Mississippi River?","answerEntity":["Q893594"],"questionEntity":["Q1497","Q846570"],"groundTruthAnswerEntity":["Q7245"],"complexityType":"intersection","graph":{"directed":true,"multigraph":false,"graph":{},"nodes":[{"type":"INTERNAL","name_":"Q30","id":0},{"type":"QUESTIONS_ENTITY","name_":"Q1497","id":1},{"type":"QUESTIONS_ENTITY","name_":"Q846570","id":2},{"type":"ANSWER_CANDIDATE_ENTITY","name_":"Q893594","id":3}],"links":[{"name_":"P17","source":0,"target":0},{"name_":"P17","source":1,"target":0},{"name_":"P17","source":2,"target":0},{"name_":"P527","source":2,"target":3},{"name_":"P17","source":3,"target":0},{"name_":"P279","source":3,"target":2}]}}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
subgraphs_dataset_creation		subgraphs_dataset_creation
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ShortPathQA

Subgraphs Extraction

Parsing Wikidata Dump

Building the Subgraphs

Dataset formattings:

About

Releases

Packages

Languages

License

s-nlp/ShortPathQA

Folders and files

Latest commit

History

Repository files navigation

ShortPathQA

Subgraphs Extraction

Parsing Wikidata Dump

Building the Subgraphs

Dataset formattings:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages