FreshWiki

FreshWiki Dataset

The FreshWiki Dataset is a collection of high-quality Wikipedia articles focusing on the most-edited pages from February 2022 to September 2023. See Section 2.1 in our paper for more details.

This dataset could be valuable for researchers working on tasks like report generation, knowledge curation, information retrieval, etc. The text data in this dataset is licensed under the Creative Commons Attribution-ShareAlike (CC BY-SA) license. Please refer to the Wikipedia reuse guidelines for details: https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content

To ease data contamination issue, we provide the source code for the data construction pipeline that can be repeated at future dates. The code can be found in the following files:

get_fresh_wiki_page.py
wikipage_extractor.py

Please cite our paper if you found the dataset or data construction pipeline useful for your research.

@inproceedings{shao2024assisting,
      title={{Assisting in Writing Wikipedia-like Articles From Scratch with Large Language Models}}, 
      author={Yijia Shao and Yucheng Jiang and Theodore A. Kanell and Peter Xu and Omar Khattab and Monica S. Lam},
      year={2024},
      booktitle={Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)}
}

Name		Name	Last commit message	Last commit date
parent directory ..
json		json
txt		txt
LICENSE		LICENSE
README.md		README.md
get_fresh_wiki_page.py		get_fresh_wiki_page.py
topic_list.csv		topic_list.csv
wikipage_extractor.py		wikipage_extractor.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FreshWiki

FreshWiki

README.md

FreshWiki Dataset

Files

FreshWiki

Directory actions

More options

Directory actions

More options

Latest commit

History

FreshWiki

Folders and files

parent directory

README.md

FreshWiki Dataset