Skip to content

panx27/wikidata-dump-processor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

29 Commits
 
 
 
 
 
 

Repository files navigation

wikidata-dump-processor

Import Wikidata json dump (.json.bz2) into Mongodb and create index

  • Index:

    Wikidata ID: { id: 1 }

    English Alias: { aliases.en.value: 1 }

    English Wikipedia Title: { sitelinks.enwiki.title: 1 }

    Freebase ID: { claims.P646.mainsnak.datavalue.value: 1 }

    subclass of: { claims.P279.mainsnak.datavalue.value.id: 1 }

    instance of: { claims.P31.mainsnak.datavalue.value.id: 1 }

    all properties: { properties: 1 }

  • Partial Index for Covered Query: { sitelinks.enwiki.title: 1, id: 1 } { labels.en.value: 1, id: 1 }

  • Performance: ~3 hours for importing, ~1 hour for indexing (--nworker 12, --chunk_size 10000, based on 20180717 dump (25 GB))

Quickstart

Step 1: import

usage: import.py [-h] [--chunk_size CHUNK_SIZE] [--nworker NWORKER]
                 inpath host port db_name collection_name

positional arguments:
  inpath                Path to inpath file (xxxxxxxx-all.json.bz2)
  host                  MongoDB host
  port                  MongoDB port
  db_name               Database name
  collection_name       Collection name

optional arguments:
  --chunk_size CHUNK_SIZE, -c CHUNK_SIZE
                        Chunk size (default=10000, RAM usage depends on chunk
                        size)
  --nworker NWORKER, -n NWORKER

Step 2: index

usage: index.py [-h] host port db_name collection_name

positional arguments:
  host             MongoDB host
  port             MongoDB port
  db_name          Database name
  collection_name  Collection name

Miscellaneous

  • If you get errno:24 Too many open files error, try to increase system limits. For example, in Linux, you can run ulimit -n 64000 in the console running mongod.

About

Import Wikidata json dump (.json.bz2) into Mongodb

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages