The extraction protocal can be divided into 2 steps.
- Parsing the Wikidata dump to build our Wikidata graph via iGraph.
- Load our Igraph representation of Wikidata and generate the subgraph dataset.
All subgraphs extraction codes can be found in subgraphs_dataset_creation/
.
Wikidata frequently releases and updates their dumps, which can be found in various formats: JSON, RDF, XML, etc. Firstly, to utilise our code, download the dumps in JSON format. Then, to parse the wikidata json dump, run:
python3 subgraphs_dataset_creation/dump_repacker.py --data_path /path/to/downloaded_dump --save_path /path/to/parse_dump
where the arguments:
data_path
refers to the path where the json dump is stored.save_path
refers to the path where we want to save the igraph triple triples representation.
After running the above script, a wikidata_triples.txt
file will be created within the saved path mentioned in the argument above. This triples text file is ready to be loaded via Igraph via:
# graph_path is where we stored wikidata_triples.txt
igraph_wikidata = Graph.Read_Ncol(
graph_path, names=True, directed=True, weights=True
)
Since parsing this Wikidata dump takes a long time, checkpoints were implemented. If for some unfortunate reason, our process crashed, you can simply rerun subgraphs_dataset_creation/dump_repacker.py
. The code will automatically continue parsing on where the crash happened.
After we have parsed the Wikidata dump and have our Igraph triples representation, we are ready for subgraphs dataset generation. Firstly, we need to pre-proccess Mintaka (fetch label for each Wikidata question entities, prepare the answer candidates by our LLM; all in 1 accessible jsonl
file). To do so, please run the jupyter notebook subgraphs_dataset_creation/mintaka_subgraphs_preparing.ipynb
. The input of this notebook are the answer candidates generated by our LLMs (.csv
and .json
formatting for T5-like and Mixtral/Mistral respectively).
Finally, to fetch the desired subgraphs, run:
python3 subgraphs_dataset_creation/mining_subgraphs_dataset_processes.py
which have the following available arguments:
--save_jsonl_path
indicates the path of the final resultingjsonl
file (with our subgraphs).--igraph_wikidata_path
indicates the path of the file with our Igraph triples representation.--subgraphs_dataset_prepared_entities_jsonl_path
indicates the path of the preproccessed Mintaka dataset, output ofsubgraphs_dataset_creation/mintaka_subgraphs_preparing.ipynb
.--n_jobs
indicates how many jobs for our multi-processing scheme. ATTENTION: Each process require ~60-80Gb RAM.--skip_lines
indicates the number of lines for skip in prepared_entities_jsonl file (from--subgraphs_dataset_prepared_entities_jsonl_path
).
After running the above file, the final data will be a jsonl
file in the path --save_jsonl_path
Each entry in the final .jsonl
file will represent one question-answer pair and its corresponding subgraph. One sample entry can be seen below:
{"id":"fae46b21","question":"What man was a famous American author and also a steamboat pilot on the Mississippi River?","answerEntity":["Q893594"],"questionEntity":["Q1497","Q846570"],"groundTruthAnswerEntity":["Q7245"],"complexityType":"intersection","graph":{"directed":true,"multigraph":false,"graph":{},"nodes":[{"type":"INTERNAL","name_":"Q30","id":0},{"type":"QUESTIONS_ENTITY","name_":"Q1497","id":1},{"type":"QUESTIONS_ENTITY","name_":"Q846570","id":2},{"type":"ANSWER_CANDIDATE_ENTITY","name_":"Q893594","id":3}],"links":[{"name_":"P17","source":0,"target":0},{"name_":"P17","source":1,"target":0},{"name_":"P17","source":2,"target":0},{"name_":"P527","source":2,"target":3},{"name_":"P17","source":3,"target":0},{"name_":"P279","source":3,"target":2}]}}