Analysis on Subgraphs in Wikidata showed how large each of the subgraphs are in Wikidata and how connected they are. This page shows the results from analysis on the queries that relate to these subgraph. The questions that needed to be answered were:
How many(percent) queries access each subgraph?
How many queries access multiple subgraphs at once? i.e, how much overlap can we expect in subgraphs?
How long do these queries take?
How many user-agents access each subgraph? How many of them access lots of subgraphs, or are they confined to a small set of subgraphs? Do some of them dominate queries in multiple subgraphs?
Are there chunks of similar queries in these subgraphs? i.e, how diverse the queries in each subgraph are.
Shorter TL;DR
The monthly queries that touch the top 341 subgraphs were analyzed. The percentage of queries for each of the subgraphs change slightly by month. Only two subgraphs have significantly more queries compared to other subgraphs. Human subgraph has 30% of the queries, and Taxon subgraph consists of ~20% of the queries.
Query times are divided into 5 classes. Most query times are low (10ms to 100ms, the second class of query time), some are 100ms to 1s (the third class of query time).
Most subgraphs don’t have a lot of user-agents accessing them. Some of them have a few user-agents doing most of the queries. The trend of a dominating single source of queries is not widespread among subgraphs.
Most user agents (89% of them) query one subgraph only. Few user agents query a lot of subgraphs as well.
70% of all queries touch on 1 or 2 subgraphs. 64% of all queries touch on only 1 subgraph.
No significant correlation was found between query time and the number of subgraphs a query tries to access.
For human subgraph:
The queries that were estimated to be related to the human subgraph accounted for 31.94% of all queries in Wikidata. 25.78% of queries used only the human subgraph and the rest 6.16% of queries used a mix of human and various other subgraphs.
The total query time of the human subgraph is 34% of the total query time. The average time per query is 0.3 seconds (300 ms). Most queries in this subgraph are small and simple.
More than 50% of the queries in human subgraph are done by ~10 user agents. So only a small number of user-agents do most of the queries.
The top 10 types of queries account for 60% of the queries of human subgraph. So a small type of queries cover the bulk of the queries.
Some user agents do a moderate amount of small simple queries of various types, but most user agents do only 1 type of query.
For taxon subgraph:
The queries that were estimated to be related to the taxon subgraph accounted for 14.26% of all queries in Wikidata. 13.57% queries used only the taxon subgraph and the rest 0.69% queries used a mix of taxon and various other subgraphs
Q-ID (Q16521) match is almost the only reason for queries to be in the taxon subgraph.
Most properties that match the taxon subgraph are some sort of external IDs.
The total query time of taxon subgraph is ~3% of total query time. The average time per query is 0.064 seconds (64 ms). Almost all queries in this subgraph are small and simple.
The top user agent performs 85% of taxon subgraph queries. Basically, a single user agent does most of the queries.
The variety of query types in taxon subgraph is quite less (1.1K) compared to 11K query types in human subgraph. Only the top 3 types of queries account for ~90% of the queries of taxon subgraph.
Most user agents make only 1 type of query.
In sum, most queries in taxon subgraph are small and simple, take between 10 to 100ms time to run, and are mostly by 1 user agents.
TL;DR
The monthly queries that touch/access each of the top 341 subgraphs was determined. The percentage of queries for each of the subgraphs change slightly by month. The list for the top 50 subgraphs for Nov, 2021 can be found here: #Query count and time, and comparison with Oct and Nov, 2021 data is shown here: #Comparison of subgraph queries across time.
Only two subgraphs have significantly more queries compared to other subgraphs. Human subgraph has 30% of the queries, and Taxon subgraph consists of ~20% of the queries.
Most subgraphs have most query times in the range of 10ms to 100ms. The second most common class is 100ms to 1s. A small number of subgraphs have more time-consuming queries (1s - 10s)
Most subgraphs don’t have a lot of user-agents accessing them. Some of them have a few user-agents doing most of the queries. ~30 out of 341 subgraphs have a user agent that queries >=50% of all queries of that particular subgraphs. 6 subgraphs have a user agent doing 80-90% of the queries. So the trend of a dominating single source of queries is not widespread among subgraphs but is present in a few subgraphs nonetheless.
Most user agents (89% of them) query one subgraph only. Some user agents query a lot of subgraphs as well. Explicit observations about some user-agents are shown in #User agent vs Subgraph section.
Looking at the connection among subgraphs through queries we see: 70% of all queries (97% of queries in 341 subgraphs) touch on 1 or 2 subgraphs. 64% of all queries (90% of queries in 341 subgraphs) touch on only 1 subgraph.
No significant correlation was found between query time and the number of subgraphs a query tries to access. But queries that access more subgraphs (although less in number) do appear slightly more in larger query time classes.
In-depth analysis was done on Human and Taxon subgraph queries since they account for the most queries per subgraph.
For human subgraph:
The queries that were estimated to be related to the human subgraph accounted for 31.94% of all queries in Wikidata. 25.78% of queries used only the human subgraph and the rest 6.16% of queries used a mix of human and various other subgraphs.
A lot of the queries are associated with human subgraph due to the properties they use, the instance of items, and URIs in subject or object.
There are some high-usage items, but mostly a long list of low-usage items. For properties, only ~10 properties account for most of the queries in the human subgraph. The matched URIs form a smooth logarithmic pattern, with most URIs being Wikipedia article links in various languages.
The total query time of the human subgraph is 34% of the total query time. The average time per query is 0.3 seconds (300 ms). Most queries in this subgraph are small and simple.
More than 50% of the queries in human subgraph are done by ~10 user agents. The top 2 user agents perform 20% of the human subgraph queries (7% of all queries). The first user agent does small queries that don’t require much time (10-100ms), whereas the second user agent performs queries that take comparatively more time (100ms - 1s). Note that user-agent strings are not directly representative of distinct users. The second top user-agent here, for example, has multiple variations of user-agent strings which were considered different at this time for simplicity.
In terms of query type (type is determined by the operations used in a query): the top 10 types of queries account for 60% of the queries of human subgraph. The rest form a long tail of less-used query types.
Most user agents do only 1 type of query. 8 user agents perform more than 500 types of queries. But looking into these types, it seems most of these types don’t have a lot of queries, and are mostly small simple queries. In short: Some user agents do a moderate amount of small simple queries of various types, and most user agents do only 1 type of query.
For taxon subgraph:
The queries that were estimated to be related to the taxon subgraph accounted for 14.26% of all queries in Wikidata. 13.57% queries used only the taxon subgraph and the rest 0.69% queries used a mix of taxon and various other subgraphs
Q-ID match is almost the only reason for queries to be in the taxon subgraph (12% out of 14%). Basically, these are the queries that have Q16521 in them.
Only a few (~3) items match with significant number of queries, forming a logarithmic distribution of item use across queries. The distribution of properties used in taxon subgraph queries is also extremely skewed by only ~10 properties, with most of these properties being some sort of external IDs. Most URIs matched are simply the taxon Q-ID (Q16521). Of the unique URIs, ~30% are various Wikipedia links.
The total query time of taxon subgraph is ~3% of total query time. The average time per query is 0.064 seconds (64 ms). Almost all queries in this subgraph are small and simple.
The top user agent performs 85% of taxon subgraph queries, and the next 4 user agents combined perform 9% of the queries. The rest of the user-agents perform <1% of the queries in taxon subgraph. In terms of time, the top user agent, which made 85% of the queries, accounts for 33% of time consumed. But two more user agents perform some comparatively heavy queries. With less than 1% of queries each, they account for 26% and 11% of query time respectively.
The variety of query types in taxon subgraph is quite less (1.1K) compared to 11K query types in human subgraph. Only the top 3 types of queries account for ~90% of the queries of taxon subgraph. The rest form a long tail of small query counts.
Most user agents make only 1 type of query. Only 3 user agents make queries of >100 types, and 5 user agents make queries of 50-100 types. The top user agents (in terms of query time and count) make mostly <5 types of queries. Except for the two user agents that accounted for comparatively more query time, they perform 35-25 types of queries. Still quite a small number.
In sum, most queries in taxon subgraph are small and simple, take between 10 to 100ms time to run, and are mostly by 1/2 user agents.
What are subgraph related queries
We define some parameters to identify whether a query touches on a subgraph based on the items and properties a query uses. Some queries may even touch multiple subgraphs. See more on what a subgraph means here. Note: Subgraphs have overlaps.
The parameters that define which subgraph a query belongs to are:
If the query uses the subgraph's Q-ID. Example: Q5 containing queries are part of Q5 subgraph.
If the query uses items that are instance of a particular subgraph.
If the query uses items that occur 99% of the times in a particular subgraph.
If the query uses properties that occur 99% of the times in a particular subgraph.
If the query uses literals that occur 99% of the times in a particular subgraph. The literals can occur with or without language tags. Both versions are compared to check for match. Note that whole literals are matched in queries and Wikidata. Queries that ask for partial matches, using regex for example, are not included. The assumption is that such queries are more likely to contain other items from the subgraph and are caught anyways.
The following analysis uses Wikidata dump of 20211101 and WDQS public SPARQL queries of 10/2021 unless otherwise stated. All query related numbers below are monthly values.
Query count and time
All queries here refer to queries with status code 200 and 500, i.e correct queries, successful or time-out.
WDQS receives ~220M queries a month.
Total query time for all queries for a month is ~16,000 hours.
The table below lists the top 50 most queried subgraphs with subgraph size and query time information of 11/2021. A breakdown of what caused the match is also present, which corresponds to the parameters mentioned in #What are subgraph related queries. It also ranks the subgraphs by size, query count, and query time consumed. A more complete list containing 341 subgraphs, that form ~90% of Wikidata triples, is available here: subgraph data for November 2021, and subgraph data for October 2021. The difference between values from October and November is shown in the next table for comparison purposes. In some places, the query count percentages differ slightly.
Top 50 most queries subgraphs in Wikidata with subgraph size information
Subgraph rank by size
Subgraph rank by query count
Subgraph rank by query time
Subgraph
Subgraph label
%of triples
%of entities
Days to recover (4.77M rate)
Query count
%count of all queries
Query time (hr)
%time of all queries
Avg time/query (s)
%query using only this subgraph
%query using this and other subgraphs
%count of query from Qid
%count of query from instance items
%count of query from items
%count of query from properties
%count of query from literals
3
1
1
Q5
human
7.254
10.045
204
60,868,572
31.941
5248
34.195
0.31
80.718
19.282
2.541
18.199
12.198
19.457
1.435
5
2
11
Q16521
taxon
2.885
3.5
81
27,172,995
14.259
480
3.131
0.064
95.179
4.821
12.19
0.746
12.862
0.87
0.433
34
3
7
Q4830453
business
0.107
0.208
3
9,228,037
4.842
554
3.607
0.216
68.097
31.903
1.646
2.95
2.24
0.001
0.177
6
4
5
Q101352
family name
1.646
0.511
46
5,990,617
3.144
659
4.292
0.396
20.867
79.133
0.041
3.057
2.791
0.018
0.038
15
5
2
Q11424
film
0.359
0.284
10
5,067,305
2.659
1541
10.042
1.095
74.15
25.85
0.451
1.469
1.348
0.003
0.543
1
6
13
Q13442814
scholarly article
48.935
39.815
1378
4,944,995
2.595
263
1.713
0.191
76.272
23.728
0.017
1.942
1.938
0.405
0.396
7
7
3
Q4167410
Wikimedia disambiguation page
1.354
1.464
38
3,292,873
1.728
765
4.982
0.836
22.922
77.078
0.164
0.192
0.472
0.0
1.163
2
8
25
Q6999
astronomical object
8.684
8.943
245
2,444,109
1.283
79
0.516
0.117
92.622
7.378
0.003
1.218
1.222
0.023
0.004
92
9
14
Q6881511
enterprise
0.036
0.052
1
1,937,486
1.017
234
1.528
0.436
2.584
97.416
0.083
0.812
0.538
0.0
0.071
26
10
29
Q484170
commune of France
0.179
0.048
5
1,934,902
1.015
70
0.455
0.13
78.322
21.678
0.024
0.869
0.085
0.115
0.01
19
11
22
Q13406463
Wikimedia list article
0.249
0.355
7
1,766,742
0.927
117
0.765
0.239
17.64
82.36
0.034
0.372
0.628
0.0
0.137
63
12
12
Q5398426
television series
0.055
0.063
2
1,379,486
0.724
411
2.68
1.073
66.963
33.037
0.048
0.376
0.369
0.0
0.167
37
13
47
Q7725634
literary work
0.087
0.203
2
1,377,546
0.723
42
0.273
0.11
6.75
93.25
0.39
0.181
0.243
0.0
0.009
16
14
4
Q486972
human settlement
0.298
0.612
8
1,328,064
0.697
699
4.557
1.896
42.354
57.646
0.328
0.39
0.236
0.0
0.005
163
15
15
Q891723
public company
0.015
0.013
0
1,175,813
0.617
219
1.426
0.67
10.916
89.084
0.042
0.415
0.185
0.001
0.092
90
16
6
Q43229
organization
0.037
0.082
1
1,067,340
0.56
600
3.908
2.023
40.99
59.01
0.259
0.227
0.146
0.0
0.021
13
17
24
Q3305213
painting
0.426
0.579
12
926,701
0.486
86
0.558
0.333
24.622
75.378
0.017
0.426
0.284
0.002
0.008
87
18
36
Q47461344
written work
0.037
0.078
1
881,216
0.462
53
0.345
0.216
4.058
95.942
0.289
0.079
0.114
0.0
0.003
25
19
32
Q532
village
0.199
0.294
6
872,310
0.458
61
0.399
0.253
39.133
60.867
0.003
0.417
0.198
0.0
0.015
4
20
28
Q4167836
Wikimedia category
5.806
5.175
164
808,536
0.424
74
0.484
0.331
81.351
18.649
0.037
0.363
0.292
0.0
0.024
61
21
51
Q7889
video game
0.055
0.048
2
753,351
0.395
37
0.244
0.179
62.267
37.733
0.006
0.181
0.314
0.002
0.01
20
22
41
Q8502
mountain
0.248
0.559
7
749,283
0.393
47
0.306
0.225
67.13
32.87
0.002
0.369
0.351
0.0
0.001
28
23
33
Q482994
album
0.16
0.288
5
704,746
0.37
59
0.388
0.304
27.474
72.526
0.012
0.15
0.189
0.0
0.098
89
24
17
Q4164871
position
0.037
0.128
1
645,434
0.339
175
1.141
0.977
12.504
87.496
0.003
0.305
0.025
0.0
0.011
8
25
16
Q7187
gene
0.911
1.273
26
604,364
0.317
208
1.354
1.238
31.577
68.423
0.084
0.1
0.022
0.015
0.127
11
26
26
Q11173
chemical compound
0.684
1.302
19
588,469
0.309
76
0.496
0.466
68.901
31.099
0.135
0.11
0.092
0.002
0.014
55
27
54
Q215380
musical group
0.062
0.087
2
585,266
0.307
37
0.241
0.227
39.016
60.984
0.01
0.205
0.16
0.0
0.011
31
28
39
Q16970
church building
0.128
0.227
4
577,677
0.303
48
0.315
0.301
43.769
56.231
0.003
0.288
0.214
0.0
0.002
71
29
55
Q732577
publication
0.047
0.076
1
569,536
0.299
37
0.238
0.231
4.203
95.797
0.283
0.015
0.296
0.0
0.0
22
30
43
Q79007
street
0.23
0.626
6
535,623
0.281
44
0.289
0.298
47.589
52.411
0.028
0.246
0.218
0.001
0.001
23
31
34
Q4022
river
0.216
0.425
6
520,347
0.273
56
0.365
0.388
53.592
46.408
0.002
0.254
0.192
0.0
0.002
242
32
8
Q14204246
Wikimedia project page
0.008
0.033
0
498,708
0.262
548
3.572
3.957
8.77
91.23
0.026
0.19
0.038
0.0
0.064
36
33
63
Q3947
house
0.096
0.216
3
465,249
0.244
33
0.212
0.252
58.051
41.949
0.0
0.238
0.223
0.0
0.002
32
34
31
Q41176
building
0.124
0.29
3
463,636
0.243
65
0.423
0.504
37.511
62.489
0.042
0.189
0.168
0.001
0.002
307
35
62
Q783794
company
0.005
0.012
0
459,638
0.241
33
0.213
0.256
44.132
55.868
0.081
0.146
0.1
0.0
0.006
29
36
48
Q23397
lake
0.136
0.279
4
456,054
0.239
42
0.273
0.331
59.859
40.141
0.002
0.227
0.211
0.0
0.001
119
37
42
Q3957
town
0.023
0.015
1
450,870
0.237
46
0.297
0.364
44.245
55.755
0.057
0.162
0.034
0.0
0.003
64
38
40
Q811979
architectural structure
0.054
0.12
2
445,779
0.234
48
0.313
0.388
12.038
87.962
0.097
0.126
0.117
0.0
0.001
80
39
59
Q34442
road
0.041
0.073
1
440,960
0.231
34
0.22
0.276
14.176
85.824
0.008
0.129
0.171
0.0
0.001
275
40
180
Q21198342
manga series
0.007
0.015
0
437,382
0.23
11
0.074
0.093
28.665
71.335
0.01
0.052
0.2
0.0
0.003
72
41
23
Q86850539
Whitaker's Latin frequency type C
0.047
0.011
1
436,103
0.229
95
0.622
0.788
10.35
89.65
0.0
0.0
0.0
0.0
0.228
138
42
139
Q18340514
events in a specific year or time period
0.019
0.048
1
431,649
0.227
16
0.104
0.133
10.729
89.271
0.0
0.21
0.068
0.0
0.004
261
43
53
Q2085381
publisher
0.007
0.015
0
420,459
0.221
37
0.243
0.319
52.906
47.094
0.001
0.21
0.068
0.0
0.004
44
44
38
Q55488
railway station
0.074
0.104
2
410,774
0.216
49
0.319
0.43
25.81
74.19
0.001
0.172
0.163
0.0
0.002
108
45
27
Q33506
museum
0.027
0.044
1
409,716
0.215
75
0.486
0.655
28.194
71.806
0.017
0.184
0.134
0.0
0.001
181
46
19
Q34770
language
0.013
0.011
0
402,013
0.211
145
0.947
1.302
33.166
66.834
0.009
0.169
0.02
0.0
0.017
112
47
86
Q15632617
fictional human
0.025
0.056
1
395,934
0.208
25
0.166
0.232
17.231
82.769
0.007
0.138
0.09
0.0
0.004
42
48
119
Q22808320
Wikimedia human name disambiguation page
0.077
0.075
2
381,873
0.2
19
0.125
0.181
67.093
32.907
0.0
0.164
0.142
0.0
0.001
143
49
75
Q11032
newspaper
0.017
0.043
0
380,153
0.199
28
0.181
0.263
55.697
44.303
0.002
0.169
0.143
0.0
0.019
38
50
117
Q3331189
version, edition, or translation
0.087
0.191
2
374,597
0.197
19
0.126
0.186
10.191
89.809
0.117
0.037
0.134
0.0
0.038
Comparison of subgraph queries across time
Comparison of subgraph queries across time (Oct, Nov 2021)
Subgraph rank by size
Subgraph
Subgraph label
%of entities
%of triples
Oct query count
Oct %count of queries
Oct query time (hr)
Oct %time of queries
Nov query count
Nov %count of queries
Nov query time (hr)
Nov %time of queries
3
Q5
human
9.986
7.324
68,659,369
31.058
6,314
39.3
60,868,572
31.941
5,248
34.195
5
Q16521
taxon
3.427
2.871
56,437,140
25.529
495
3.1
27,172,995
14.259
480
3.131
34
Q4830453
business
0.207
0.108
4,041,395
1.828
343
2.1
9,228,037
4.842
554
3.607
6
Q101352
family name
0.509
1.546
5,564,173
2.517
640
4.0
5,990,617
3.144
659
4.292
15
Q11424
film
0.281
0.364
4,757,084
2.152
1,613
10.0
5,067,305
2.659
1,541
10.042
1
Q13442814
scholarly article
39.794
49.668
1,649,268
0.746
142
0.9
4,944,995
2.595
263
1.713
7
Q4167410
Wikimedia disambiguation page
1.459
1.374
3,737,550
1.691
223
1.4
3,292,873
1.728
765
4.982
2
Q6999
astronomical object
8.942
8.75
448,032
0.203
51
0.3
2,444,109
1.283
79
0.516
92
Q6881511
enterprise
0.052
0.036
943,613
0.427
164
1.0
1,937,486
1.017
234
1.528
26
Q484170
commune of France
0.043
0.18
866,766
0.392
46
0.3
1,934,902
1.015
70
0.455
20
Q13406463
Wikimedia list article
0.352
0.252
1,283,160
0.58
73
0.5
1,766,742
0.927
117
0.765
63
Q5398426
television series
0.062
0.055
1,206,285
0.546
366
2.3
1,379,486
0.724
411
2.68
42
Q7725634
literary work
0.176
0.077
468,204
0.212
22
0.1
1,377,546
0.723
42
0.273
16
Q486972
human settlement
0.602
0.302
721,789
0.327
73
0.5
1,328,064
0.697
699
4.557
165
Q891723
public company
0.013
0.015
837,595
0.379
157
1.0
1,175,813
0.617
219
1.426
91
Q43229
organization
0.08
0.037
806,840
0.365
123
0.8
1,067,340
0.56
600
3.908
12
Q3305213
painting
0.578
0.432
834,752
0.378
79
0.5
926,701
0.486
86
0.558
86
Q47461344
written work
0.078
0.038
774,947
0.351
67
0.4
881,216
0.462
53
0.345
25
Q532
village
0.292
0.201
584,789
0.265
21
0.1
872,310
0.458
61
0.399
4
Q4167836
Wikimedia category
5.165
5.85
1,383,343
0.626
96
0.6
808,536
0.424
74
0.484
62
Q7889
video game
0.047
0.056
741,401
0.335
30
0.2
753,351
0.395
37
0.244
19
Q8502
mountain
0.559
0.253
227,393
0.103
16
0.1
749,283
0.393
47
0.306
28
Q482994
album
0.287
0.161
776,845
0.351
37
0.2
704,746
0.37
59
0.388
89
Q4164871
position
0.128
0.037
788,077
0.356
332
2.1
645,434
0.339
175
1.141
8
Q7187
gene
1.273
0.927
628,916
0.284
94
0.6
604,364
0.317
208
1.354
10
Q11173
chemical compound
1.302
0.693
1,307,852
0.592
133
0.8
588,469
0.309
76
0.496
54
Q215380
musical group
0.087
0.063
461,181
0.209
17
0.1
585,266
0.307
37
0.241
31
Q16970
church building
0.226
0.129
396,936
0.18
25
0.2
577,677
0.303
48
0.315
70
Q732577
publication
0.076
0.048
512,416
0.232
53
0.3
569,536
0.299
37
0.238
22
Q79007
street
0.62
0.231
225,188
0.102
20
0.1
535,623
0.281
44
0.289
23
Q4022
river
0.425
0.219
280,190
0.127
20
0.1
520,347
0.273
56
0.365
243
Q14204246
Wikimedia project page
0.033
0.008
1,114,113
0.504
62
0.4
498,708
0.262
548
3.572
36
Q3947
house
0.216
0.098
118,886
0.054
9
0.1
465,249
0.244
33
0.212
32
Q41176
building
0.287
0.125
271,666
0.123
36
0.2
463,636
0.243
65
0.423
310
Q783794
company
0.012
0.005
124,932
0.057
19
0.1
459,638
0.241
33
0.213
29
Q23397
lake
0.278
0.138
130,027
0.059
14
0.1
456,054
0.239
42
0.273
121
Q3957
town
0.015
0.023
294,685
0.133
24
0.1
450,870
0.237
46
0.297
64
Q811979
architectural structure
0.119
0.055
282,739
0.128
28
0.2
445,779
0.234
48
0.313
80
Q34442
road
0.073
0.041
215,771
0.098
14
0.1
440,960
0.231
34
0.22
280
Q21198342
manga series
0.014
0.007
208,503
0.094
5
0.0
437,382
0.23
11
0.074
71
Q86850539
Whitaker's Latin frequency type C
0.011
0.048
355,247
0.161
56
0.3
436,103
0.229
95
0.622
138
Q18340514
events in a specific year or time period
0.048
0.019
463,683
0.21
17
0.1
431,649
0.227
16
0.104
264
Q2085381
publisher
0.014
0.007
179,442
0.081
23
0.1
420,459
0.221
37
0.243
45
Q55488
railway station
0.104
0.075
258,862
0.117
20
0.1
410,774
0.216
49
0.319
108
Q33506
museum
0.044
0.028
252,308
0.114
54
0.3
409,716
0.215
75
0.486
177
Q34770
language
0.011
0.013
1,713,196
0.775
73
0.5
402,013
0.211
145
0.947
113
Q15632617
fictional human
0.056
0.026
306,319
0.139
18
0.1
395,934
0.208
25
0.166
41
Q22808320
Wikimedia human name disambiguation page
0.075
0.078
433,986
0.196
17
0.1
381,873
0.2
19
0.125
144
Q11032
newspaper
0.043
0.017
230,085
0.104
11
0.1
380,153
0.199
28
0.181
37
Q3331189
version, edition, or translation
0.19
0.087
410,352
0.186
34
0.2
374,597
0.197
19
0.126
More on query time
The query time can be broken down to classes for better visualization. Below is a figure with the query class distribution (number of queries per query time class per subgraph) for the top 50 subgraphs w.r.t query time consumed. Some of the takeaways are:
Most subgraphs have most queries in the range of 10ms to 100ms
Second most common class is 100ms to 1s
collection and photograph have most queries (~150k) timed at 1s to 10s. Around 10 more subgraphs have a little (~10-20k) query in this time range.
Distribution comparison
Following is the query time distribution of all queries, regardless of subgraph.
We then compare this distribution to the distributions listed above for the top subgraphs. To compare, we plot the differences of percentages in each subgraph with the percentage in all queries. That is,
For each subgraph:
For each query time class:
Percent of query count for this time class in this subgraph - Percent of query count in all queries
This shows us the difference in overall distribution with the distribution in the individual subgraphs.
User agent
Analysis on user-agent is an approximation because these don't completely represent distinct users. For example lots people use the same bot or script without changing the user-agent, or the same person or bot uses multiple user-agent strings. Yet based on the available data we can get an estimate nevertheless.
User agent count
Total number of unique user agents across all subgraphs: 981,180
First, a list of subgraphs with most and least distinct user-agents is listed. It seems the least number of user-agents a subgraph has is at least 10. So the large subgraphs are used by at least a bunch of users.
The largest numbers of user-agents are present in a variety of type of subgraphs. gene, protein, biological process, molecular function appear to be similar among them. It is possible the same queries represent several of these subgraphs. More on subgraph connectivity in #Subgraph Connectivity.
Subgraphs with most user-agents
Subgraph
Subgraph label
%Query
#User agents
%User agent
Q11424
film
2.152
251,420
0.256
Q8054
protein
0.158
234,659
0.239
Q7187
gene
0.284
187,029
0.191
Q2996394
biological process
0.072
124,415
0.127
Q14860489
molecular function
0.044
89,445
0.091
Q5
human
31.058
55,377
0.056
Q898273
protein domain
0.019
38,484
0.039
Q16521
taxon
25.529
25,193
0.026
Q86850539
Whitaker's Latin frequency type C
0.161
20,158
0.021
Q4167410
Wikimedia disambiguation page
1.691
13,818
0.014
Q14204246
Wikimedia project page
0.504
13,443
0.014
Q476028
association football club
0.145
12,086
0.012
Q235557
file format
0.045
7,701
0.008
Q1520033
count noun
0.05
7,662
0.008
Q417841
protein family
0.007
4,906
0.005
Q484170
commune of France
0.392
4,764
0.005
Q4830453
business
1.828
4,383
0.004
Q4164871
position
0.356
4,319
0.004
Q7278
political party
0.109
4,073
0.004
Q3918
university
0.104
3,565
0.004
Subgraphs with least user-agents
Subgraph
Subgraph label
%Query
#User agents
%User agent
Q106006703
local regulations of the People's Republic of China
0.0
11
0.0
Q67018630
Government Boys' Primary School
0.0
13
0.0
Q7604693
Statutory Rules of Northern Ireland
0.0
13
0.0
Q106474968
ethnic group by settlement in Macedonia
0.003
15
0.0
Q6453643
decree law
0.0
15
0.0
Q97695005
committee group motion
0.0
15
0.0
Q100532807
Irish Statutory Instrument
0.0
16
0.0
Q10429085
report
0.0
19
0.0
Q99045339
written question
0.0
20
0.0
Q1505023
Interpellation
0.0
20
0.0
Q96739634
individual motion
0.0
21
0.0
Q67035425
ASTM standard
0.0
21
0.0
Q61278455
health sub-centre
0.001
23
0.0
Q26267864
Wikimedia KML file
0.005
23
0.0
Q3508250
Syndicat intercommunal
0.02
24
0.0
Q107102664
cell line from embryonic stem cells
0.0
24
0.0
Q7604686
UK Statutory Instrument
0.0
27
0.0
Q6451276
Congressional Research Service report
0.001
28
0.0
Q61443650
sub post office
0.0
33
0.0
Q26894053
basketball team season
0.009
34
0.0
There are 50 subgraphs with more than 1000 user agents, and 300 subgraphs with less than 1000 user agents. Most subgraphs are therefore not queried by too many distinct users. The distribution of user-agent counts less than 1000 is shown in the figure below. This clearly shows the small number of user counts in most subgraphs.
User agent distribution in subgraphs
Next, the user agent vs query count distribution was analyzed for some of the top subgraphs. While user agent count gives us an idea of how many users may be using a subgraph, it is not clear whether all of them query the subgraph equally, or very few user agents perform most of the queries.
~30 out of 341 subgraphs have a user agent that queries >=50% of all queries of that particular subgraphs.
6 subgraphs have a user agent querying around 80-90% of the time.
So the trend of a dominating single source of queries is not wide spread among subgraphs, but is present in few subgraphs nonetheless.
The figure below shows the top 2 user-agent query in percents for 341 subgraphs. This shows whether there is a dominating pattern in a subgraph with the top user agents per subgraph.
The figure below shows 100 subgraphs with their user agent query usage distribution in percents. Usage greater than 50% is marked in red. A birds-eye view of the plots shows how some subgraphs have a dominating user agent and most other subgraphs have at least 1 or 2 user agents that query the most. The rest of the user agents form the long tail of the distribution
Top user agents in subgraphs
The top user agents in various subgraphs is listed below. More analysis on Q5 (human) and Q16521 (taxon) is done at the end of the page as they are the most queried subgraphs.
Top user agents in various subgraphs
Subgraph
Subgraph label
User agent
Query count (in subgraph)
Query percent (within subgraph)
Query percent overall
Q16521
taxon
mix-n-match
50,622,670
89.697
22.899
Q5
human
UA # 2
9,017,930
13.134
4.079
Q5
human
mix-n-match
8,548,335
12.45
3.867
Q5
human
UA # 3
5,059,258
7.369
2.289
Q5
human
UA # 4
4,020,496
5.856
1.819
Q5
human
UA # 5
3,828,747
5.576
1.732
Q101352
family name
UA # 5
3,828,747
68.811
1.732
Q5
human
UA # 6
2,685,807
3.912
1.215
Q5
human
UA # 7
2,434,486
3.546
1.101
Q4830453
business
UA # 8
2,403,677
59.476
1.087
Q5
human
UA # 9
2,020,598
2.943
0.914
Q16521
taxon
Hub
1,984,437
3.516
0.898
Q5
human
UA # 11
1,877,700
2.735
0.849
Q5
human
UA # 12
1,781,161
2.863
0.806
Q16521
taxon
UA # 13
1,294,113
2.293
0.585
User agent vs Subgraph
So far we have explored the user-agent count and distribution per subgraph. It is also important to note the user agent's query across subgraphs. In other words,
Do users have a very specific use case and so the queries spans only a few subgraphs? or is it spread across a lot of subgraphs?
Are there some user agents that query the most in multiple subgraphs? This could be due to the nature of the use case or simply because some subgraphs overlap a lot.
We start by looking at how many user agents access how many subgraphs. From the table below, we see that most user agents (89% of them) query one subgraph only. Some user agents query a lot of subgraphs as well. A clearer picture is seem from the plot below.
Relationship between subgraphs and user agents
#of Subgraphs (X)
#of User agents querying X subgraphs
%of User agents querying X subgraphs
1
875,724
89.252
2
91,962
9.373
5
3,562
0.363
3
2,388
0.243
6
1,539
0.157
7
799
0.081
9
628
0.064
8
463
0.047
4
460
0.047
12
332
0.034
16
308
0.031
15
282
0.029
10
281
0.029
17
242
0.025
18
235
0.024
14
202
0.021
11
184
0.019
19
177
0.018
13
167
0.017
20
119
0.012
21
75
0.008
22
47
0.005
25
46
0.005
23
39
0.004
24
39
0.004
27
32
0.003
26
28
0.003
28
26
0.003
29
25
0.003
30
20
0.002
31
17
0.002
35
16
0.002
37
16
0.002
47
15
0.002
34
15
0.002
61
13
0.001
32
12
0.001
50
12
0.001
36
11
0.001
44
11
0.001
49
10
0.001
65
9
0.001
56
9
0.001
72
9
0.001
51
9
0.001
121
9
0.001
95
9
0.001
124
9
0.001
42
9
0.001
39
9
0.001
Next we isolate user agents from each subgraph who query drastically more (>=10% difference) than other user agents in the same subgraph, and perform at least 100k queries (0.05% of all queries) a month. A list of ~30 such user agents was found. A plot with subgraph distributions of all these user agents was observed to find some large buckets where they tend to query. The plot is shows below, followed by some explicit observations.
Percentages below are percent of all monthly queries.
mix n match (UA #17):
a lot of taxon queries (Q16521), 23%
a lot of human queries (Q5), 4%
UA #6:
1% in Business (Q4830453)
UA #14:
1% in human (Q5)
0.5% in film (Q11424)
UA #23:
1.73% in family name (Q101352)
1.73% in human (Q5)
both have exact counts, meaning they could be the same queries that touch both human and family name subgraphs
For reference:
100% percent is 221,067,674 queries
10% percent is 22,106,767 queries
1% percent is 2,210,676 queries
0.1% percent is 221,067 queries
0.05% percent is 110,533 queries
0.01% percent is 22,106 queries
Subgraph connectivity through queries
Subgraph connectivity was explored to some extent using only Wikidata in Wikidata_Subgraph_Analysis. This was based on what items or properties were common between subgraphs and how many direct connections were present between them. A visualization was created to show the strength of this connectivity between subgraphs here: wikidata_graph. This section aims to analyze the connectivity of subgraphs through the queries, i.e, how often are some subgraphs queried together.
Subgraph Queries: The total number of queries that touch at least one of the top 341 subgraphs is 72% of all queries.
First we look at how many subgraphs do most queries access. The tables below show the least and most query groups by number of subgraphs accessed.
70% of all queries (97% of subgraph queries) touch on 1 or 2 subgraph. 64% of all queries (90% of subgraph queries) touch on only 1 subgraph.
Queries with most subgraphs accessed
#of Subgraphs
#of Queries
341
25
333
1
315
2
313
3
258
1
181
3
152
1
142
1
133
2
130
2
129
1
128
2
127
4
126
4
125
9
Queries with least subgraphs accessed
#of Subgraphs
#of Queries
%of Queries
1
142,507,736
64.463
2
12,464,811
5.638
3
1,767,253
0.799
4
586,173
0.265
5
364,445
0.165
6
221,485
0.1
7
188,012
0.085
8
112,922
0.051
9
102,524
0.046
10
68,871
0.031
11
50,341
0.023
12
38,102
0.017
13
34,075
0.015
14
24,003
0.011
15
17,935
0.008
It is hard to view which subgraphs occur together from the data above. So the subgraphs that occurred together were broken into pairs and pars of subgraphs that occur together the most were listed.
There are 57,970 subgraphs pairs that occur together in queries. Total possible subgraph pair count is (340*341)/2 = 57,970. This shows that every subgraph is connected to every other subgraph through queries! Of course the number of queries vary widely.
A list of some of the most queried subgraph pairs is shown below.
Top pairs of subgraphs that are queried together
Subgraph 1
Subgraph 2
Query
Subgraph
Subgraph label
Subgraph
Subgraph label
#of Query
%of Query
Q101352
family name
Q5
human
4,649,345
2.44
Q4830453
business
Q6881511
enterprise
1,858,183
0.975
Q11424
film
Q5
human
1,096,150
0.575
Q5
human
Q7725634
literary work
1,067,191
0.56
Q4830453
business
Q891723
public company
973,565
0.511
Q13406463
Wikimedia list article
Q5
human
970,047
0.509
Q16521
taxon
Q5
human
890,304
0.467
Q4167410
Wikimedia disambiguation page
Q5
human
840,151
0.441
Q4830453
business
Q5
human
680,786
0.357
Q3305213
painting
Q4167410
Wikimedia disambiguation page
606,434
0.318
Q6881511
enterprise
Q891723
public company
572,986
0.301
Q13442814
scholarly article
Q5
human
527,538
0.277
Q47461344
written work
Q732577
publication
514,321
0.27
Q4164871
position
Q5
human
480,484
0.252
Q13442814
scholarly article
Q4167410
Wikimedia disambiguation page
446,490
0.234
Q482994
album
Q5
human
409,139
0.215
Q13406463
Wikimedia list article
Q16521
taxon
401,466
0.211
Q13406463
Wikimedia list article
Q4167410
Wikimedia disambiguation page
349,421
0.183
Q14204246
Wikimedia project page
Q4167410
Wikimedia disambiguation page
341,845
0.179
Q43229
organization
Q5
human
337,868
0.177
Q5398426
television series
Q5
human
323,501
0.17
Q215380
musical group
Q5
human
320,532
0.168
Q47461344
written work
Q5
human
313149
0.164
Q5
human
Q6881511
enterprise
285,110
0.15
Q3331189
version, edition, or translation
Q5
human
283,741
0.149
Q5
human
Q86850539
Whitaker's Latin frequency type C
280,866
0.147
Q11424
film
Q13406463
Wikimedia list article
272,316
0.143
Q13406463
Wikimedia list article
Q18340514
events in a specific year or time period
270,710
0.142
Q16521
taxon
Q4167410
Wikimedia disambiguation page
266,507
0.14
Q4167410
Wikimedia disambiguation page
Q86850539
Whitaker's Latin frequency type C
249,340
0.131
The distribution of the number of times each subgraph pair in Wikidata occurs in queries is shown below. Note that (A,B) pair is the same as (B,A) pair, so there is no duplication in the plots. Since the plot is extremely skewed, three plots with various limits on the number of occurrences are shown. We can see how only a small number of pairs occur a lot together, they can be viewed from the table above. Whereas a huge number of pairs occur a very small number of times.
Below is a heat-map of the number of queries, where both x and y axis represent subgraph indices (names of subgraphs not shown due to space). The subgraphs are sorted by most queried subgraphs.
The diagonals show queries that use only 1 subgraph and are represented as Q5-Q5, or Q42-Q42 for example. Other are represented as Q5-Q42 or Q42-Q5
It is a Symmetrical plot.
The tons of vertical and horizontal lines indicate there are lots of subgraphs that happen to pair with many other subgraphs.
To view whether there is a correlation between accessing more subgraphs and query time, various subsets of subgraphs were taken and their query time distributions were observed. Then, number (and percents) of queries that access various number of subgraphs were plotted for each query time group. A simple scatter plot with time and subgraph number was not possible due to the large number of queries, but the following plots give us a good idea of the correlation. The pearson correlation of query time and number of subgraphs accessed is 0.016. We see there is a slight correlation but it is not significant enough. All query time groups are dominated by queries that access 1 or 2 subgraphs. Queries accessing more subgraphs do appear comparatively more in More than 10s group.
The following analysis was done with data from November 2021. Thus there are slight differences in numbers from the above analysis, which were done with October 2021 data.
Paired subgraph query time analysis
We observed that some queries do indeed take more time when more subgraphs are involved. But this could also occur because of the particular subgraphs being accessed. For example, even simple queries in the scholarly article subgraph may take a long time time out simply due to the large size of graph it has to comb through. To make this analysis complete, we look at the query times of queries that access only subgraph X and those that access X and other subgraphs. The influence of the other subgraphs persists, but we can now pick out anything clearly similar or different. If both plots look similar (the query classes), then we can assume it to be the effect of subgraph X. If there are more long-running queries when the query accesses X with other subgraphs, than when it accesses only X, then we can assume the cause of this long-running queries is not solely X. It could be due to other subgraphs, or simply due to the nature of the query (too complex, lots of string manipulation, regex etc).
The plots below shows the comparison of query time class distribution for when the queries use only subgraph X versus when they use X and other subgraphs, where X represents the top 30 large subgraphs. It indeed looks like some subgraphs have queries with longer time when other subgraphs are involved, such as for scholarly articles, lake, hill, clinical trial, river etc.
Human subgraph (Q5) query analysis
The following analysis was done with query data of November, 2021.
The queries that were estimated to be related to the human subgraph accounted for 31.94% of all queries in Wikidata. 25.78% queries used only the human subgraph and the rest 6.16% queries used a mix of human and various other subgraphs. As described in #What are subgraph related queries, subgraphs are related to queries through Properties, Subject or Object URIs, Subgraph instance items, etc. Here is a breakdown for human subgraph taken from #Query count and time. A query can be said to be related to human subgraph due to multiple of the following reasons.
Number of queries: 60,868,572 (31.94%)
Percent of queries matching subgraph Qid, i.e, has Q5: 2.54%
Percent of queries matching instance items: 18%
Percent of queries matching subject/object URIs: 12%
Percent of queries matching properties: 19.45%
Percent of queries matching literal strings: 1.43%
Some of these breakdown have large percentages. It is worth looking at what items/properties/URIs are queried the most. Also looking at the distribution of such items' usage in queries shows how narrow or wide the search space is.
Here is a detailed breakdown of what kind of match caused a query to be part of the human subgraph:
Human subgraph query breakdown
item
predicate
URI
human Q-id
literal
# query
% all query
% human query
0
1
0
0
0
17,785,347
9.333
29.219
1
1
1
0
0
12,215,379
6.41
20.068
1
0
0
0
0
10,705,360
5.618
17.588
1
0
1
0
0
7,253,287
3.806
11.916
1
1
0
0
0
3,137,130
1.646
5.154
0
0
1
0
0
2,512,142
1.318
4.127
0
0
0
0
1
1,775,347
0.932
2.917
0
1
0
1
0
1,694,236
0.889
2.783
0
0
0
1
0
930,137
0.488
1.528
1
1
0
1
0
598,261
0.314
0.983
1
1
1
1
0
508,706
0.267
0.836
0
1
0
0
1
407,610
0.214
0.67
0
0
0
1
1
350,982
0.184
0.577
0
1
1
1
0
311,340
0.163
0.511
0
1
1
0
0
226,959
0.119
0.373
1
0
0
1
0
178,650
0.094
0.294
0
1
1
1
1
135,684
0.071
0.223
1
0
1
1
0
76,736
0.04
0.126
0
1
0
1
1
56,971
0.03
0.094
1
0
1
0
1
3,451
0.002
0.006
1
0
0
0
1
2,844
0.001
0.005
0
0
1
0
1
702
0.0
0.001
1
1
1
1
1
437
0.0
0.001
1
1
0
1
1
393
0.0
0.001
0
0
1
1
0
304
0.0
0.0
1
1
0
0
1
93
0.0
0.0
1
0
0
1
1
59
0.0
0.0
1
1
1
0
1
17
0.0
0.0
1
0
1
1
1
5
0.0
0.0
0
1
1
0
1
3
0.0
0.0
Total
60,868,572
31.94
100
Instance items matched
Total items used: 7,969,182
Total queries that use these items: 34,680,808 (18% of all queries)
The distribution shows there are some high usage (~10k-20k queries) items, a small number of medium usage (~5k queries) items, and rest form a long tail of small usage (<1k queries) items in the human subgraph.
Top items that cause a query to be related to Human subgraph (Q5)
Instance item
Instance item label
#of queries
Q22686
Donald Trump
19,759
Q1747297
Robert Oliveri
19,247
Q509260
John Zimmerman
19,193
Q6499255
Laura Nader
19,135
Q209394
Michael Wood
19,101
Q937
Albert Einstein
19,098
Q7340648
Rob Whitehurst
19,026
Q52354375
Irene Aparicio
18,970
Q6232209
John F. Cassidy
18,964
Q22986632
Lori Lynn Ross
18,954
Q3976229
Stuart Lancaster
18,953
Q106466114
Gary Michael Ritchie
18,947
Q86599148
James Spicer
18,926
Q87653156
David A. Cook
18,919
Q16015822
Jerry Fleck
18917
Q7179427
Petur Hliddal
18,914
Q19878977
Jackie Carson
18,902
Q99859767
Kathy McCarty
18,898
Q90307934
Ann Harris
18,893
Q1070508
Cheryl Carasik
18,834
Q9682
Elizabeth II
18,816
Q6279
Joe Biden
18,277
Q64840837
Dylan Arnold
18,161
Q76
Barack Obama
18,035
Q107626126
Mauricio Lara
18,010
Properties matched
Total properties used: 1,091 (Recall these are properties that occur 99% of the times in the human subgraph)
Total queries that use these properties: 37,078,566 (19.45% of all queries)
The distribution shows there are 3 properties with ~20-30M queries, 7 properties with ~1-5M queries, and rest of the more than 1000 properties match ~100K and less queries. In short, the distribution is a extremely skewed by only ~10 properties that are highly related to the human subgraph.
Top properties that cause a query to be related to Human subgraph (Q5)
Property
Property label
#of queries
P570
date of death
30,151,024
P569
date of birth
30,084,200
P27
country of citizenship
24,186,000
P106
occupation
5,259,920
P734
family name
4,871,326
P735
given name
4,616,631
P19
place of birth
2,379,702
P2949
WikiTree person ID
1,707,373
P20
place of death
1,222,037
P4985
TMDb person ID
916,399
P39
position held
750,380
P3602
candidacy in election
599,067
P69
educated at
561,380
P26
spouse
471,589
P108
employer
384,111
P2562
married name
279,197
P937
work location
258,707
P1066
student of
158,339
P184
doctoral advisor
152,318
P1960
Google Scholar author ID
151,507
P185
doctoral student
150,982
P54
member of sports team
150,573
P1153
Scopus author ID
150,545
P119
place of burial
144,027
P3829
Publons author ID
138,839
Subject/Object URI matched
Total URIs used: 7,926,297 (Recall these are URIs that occur 99% of the times in the human subgraph)
Total queries that use these URIs: 23,245,152 (12.2% of all queries)
The top URIs/items show the obvious and most common ways the human subgraph is queried: query about specific people, about groups of people, and about their wikipedia pages. More about types of queries below.
The distribution is a smooth logarithmic graph with only one item present in 165k queries, and the rest go down from 40k in a logarithmic pattern.
Top URIs that cause a query to be related to Human subgraph (Q5)
URI
URI label
#of queries
Q3391743
visual artist
165,540
Q1925963
graphic artist
38,897
Q28389
screenwriter
33,718
en.wikipedia.org/wiki/Lee_Child
-
33,179
en.wikipedia.org/wiki/Emily_Wilson_(journalist)
-
30,837
en.wikipedia.org/wiki/M.I.A._(rapper)
-
29,388
Q10800557
film actor
29,318
en.wikipedia.org/wiki/Shannon_Lee
-
29,216
en.wikipedia.org/wiki/Eugene_Gordon_Lee
-
29,205
en.wikipedia.org/wiki/Lee_Childs
-
29,203
en.wikipedia.org/wiki/Emily_Wilson_(classicist)
-
26,864
en.wikipedia.org/wiki/Emily_Wilson_(actress)
-
26,862
en.wikipedia.org/wiki/Adhir_Kalyan
-
26,862
en.wikipedia.org/wiki/Emily_Wilson_(footballer)
-
26,861
en.wikipedia.org/wiki/Emily_Wilson_Walker
-
26,861
Q10798782
television actor
24,679
Q185351
jurist
22,130
Q1650915
researcher
21,206
Q2374149
botanist
20,385
Q250867
Catholic priest
20,314
Q10873124
chess player
19,832
Q12299841
cricketer
19,414
Q14373094
rugby league player
19,396
Q509260
John Zimmerman
19,193
Q6499255
Laura Nader
19,135
Query time
The total query time of human subgraph is 34% of total query time and total query count is ~32% of all queries.
Average time per query is 0.3 seconds (300 ms). Most queries in this subgraph are small and simple.
The query time distribution is shown in the chart below, both in absolute counts and in percent of queries in human subgraph.
User agent
List of top user agents that query human subgraph is given below. This helps us view the distribution of usage - whether few user agents dominate the usage or it is a rather well distributed usage scenario across user agents. Top 10 user agents in terms of query count and query time is shown in the table below.
Top user agents in human subgraph
User agent
Query count
% query in human subgraph
% query overall
Query time(hr)
% query time in human subgraph
% query time overall
mix-n-match
6,960,988
11.436
3.653
79
1.51
0.516
searx1
6,615,319
10.868
3.471
778
14.832
5.072
UA#3
3,491,821
5.737
1.832
75
1.426
0.487
UA#4
3,073,725
5.05
1.613
175
3.327
1.138
UA#5
2,933,240
4.819
1.539
80
1.516
0.518
UA#6
2,488,807
4.089
1.306
19
0.364
0.125
UA#7
2,182,220
3.585
1.145
44
0.841
0.288
WikidataQueryServiceR
2,044,045
3.358
1.073
36
0.68
0.232
UA#9
1,970,264
3.237
1.034
27
0.524
0.179
searx2
1,909,144
3.137
1.002
200
3.808
1.302
UA#11
75,523
0.124
0.04
434
8.271
2.828
UA#12
55,357
0.091
0.029
319
6.083
2.08
searx3
1,428,789
2.347
0.75
151
2.871
0.982
OB-bot
287,534
0.472
0.151
144
2.736
0.935
UA#15
50,915
0.084
0.027
134
2.553
0.873
UA#16
31,298
0.051
0.016
112
2.132
0.729
searx4
771,932
1.268
0.405
92
1.761
0.602
The query time breakdown was plotted for the top 20 user agents (in terms of time). Most queries have query time of 10ms to 1s, as observed earlier. Some user agents have most queries in the range 10ms to 100ms and some others have most queries in the range 100ms to 1s.
Query types
Query types are grouped by the operations a query uses and also the order of operations used. This groups similar queries together despite different information sought and also separates groups of simple or complicated queries. The human subgraph has ~11,500 different types of queries. Notice that some query groups can be very similar in what they ask for, while most groups differ a lot. The top query groups are listed below. The top 10 types of queries account for 60% of the queries of human subgraph. The rest form a really long tail of small query counts.
Looking at the top 20 query groups (that make up 70% of all human subgraph queries), the following query types were found:
Query lists lots of predicates with times/date precision of some items
Mix n Match: Birth and Death of certain people, with filters
Like 1, slightly different
Name and family information of people in specific languages
All properties of some humans (these queries are generic but here used for humans)
Uses P227 (or some international ID) and retrieves the Wikipedia page for it
Same as 6 but written differently
Just wants labels in specific languages
Same as 6 but written differently
All contact info of people (facebook, instagram, youtube, twitter, etc)
Labels and Wikipedia article of people
Only label and description of people
Search for films by director filter
CEO or high officials of companies searched by name strings
Timeline of people of a particular occupation and particular gender
Occupation, name, birth, death of people
Education institution and its reference of people
Label and Wikipedia page of people in specific languages
List all people, or all people of certain occupation, or entity related to a given Wikipedia page (generic query)
Notable works, labels, and IDs of works like isbns
UA vs query types
Getting the number of query types per user agent informs us of the variety of queries a user agent makes to WDQS. This also breaks down the human subgraph queries into finer groups. The following plot shows the number of query types for each user agent in the human subgraph.
This shows us that most user agents make only 1 type of query. Only 8 user agents make queries of >500 types, and ~50 user agents make queries of >100 types. Looking into query counts in each of these UA - query type groups, we find that they have few queries (<10,000), and only ~10 groups have >10,000 queries, but all of these are small simple queries. The figure below shows the number of query per query type for the top 8 user agents. As we can see, their distribution looks alike although their query counts and query types differ.
Query type vs time class
While there are close to 11,500 query types, 20 of these types make 70% of all queries of human subgraph (22% of all queries), not all of them are equally time consuming. Some can be simple queries, while some can be long and complex. The following plot shows these 20 query types with query time classes. The values above the bar show both percent in human subgraph and overall query percentage. The subplots are titled with percent of the number of queries in that query type.
Services
The queries use ~50 unique services. The top 10 services are the most used; rest are used in less than 50 queries, mostly in less than 10 queries. 20 of these services are used in only 1 query.
Some query type analysis done in section query types gives us a good idea of what kind of queries human subgraph receives. Looking at the triples themselves also helps us peek into what most of the queries look like, what the most common subjects, objects, and properties are. The table below lists these along with the top Wikidata items and properties used overall. From the numbers it seems the top items are probably part of the same queries.
Paths are more complex predicates that chain properties with logic. Complex paths can increase the scope of a query and also increase its runtime. The table below lists the most used paths in human subgraph queries. While most path are not very complex or long, there are a lot of variety in ways paths are formed to perform queries. Ordinary properties are not considered as paths. The following list contains not only the paths, but also their breakdown into components paths (as done by Jena ARQ while parsing SPARQL queries). For instance: (p:P31/ps:P31)/(wdt:P279)* is recorded as:
(p:P31/ps:P31)/(wdt:P279)*
(p:P31/ps:P31)
p:P31
ps:P31
(wdt:P279)*
wdt:P279
The unit form, wdt:P279 for example, was removed from the path list since they are part of other paths and not paths themselves. More paths that seemed obvious as being part of a longer path, and not paths themselves, were also removed from the list for better visualization of the distinct paths used in the queries.
The following analysis was done with query data of November, 2021.
The queries that were estimated to be related to the taxon subgraph accounted for 14.26% of all queries in Wikidata. 13.57% queries used only the taxon subgraph and the rest 0.69% queries used a mix of taxon and various other subgraphs. As described in #What are subgraph related queries, subgraphs are related to queries through Properties, Subject or Object URIs, Subgraph instance items, etc. Here is a breakdown for taxon subgraph taken from #Query count and time. A query can be said to be related to taxon subgraph due to multiple of the following reasons.
Number of queries: 27,172,995 (14.26%)
Percent of queries matching subgraph Q-ID, i.e, has Q5: 12.19%
Percent of queries matching instance items: 0.75%
Percent of queries matching subject/object URIs: 12.86%
Percent of queries matching properties: 0.87%
Percent of queries matching literal strings: 0.43%
Percent of queries matching subject/object URIs (12.86) includes Q-ID (12.19) and instance items (0.75) in them. This makes Q-ID match almost the only reason for queries to be in taxon subgraph. Therefore we look at the top URIs that cause these matches. Also looking at the distribution of such items' usage in queries shows how narrow or wide the search space is (in this case, quite narrow as almost all queries match the Q-ID itself).
Here is a detailed breakdown of what kind of match caused a query to be part of the taxon subgraph:
Taxon subgraph query breakdown
item
predicate
URI
taxon Q-id
literal
# query
% all query
% taxon query
0
0
1
1
0
22,955,853
12.046
84.48
0
1
0
0
0
1,415,482
0.743
5.209
1
0
0
0
0
638,533
0.335
2.35
0
0
1
0
0
624,147
0.328
2.297
1
0
1
0
0
501,232
0.263
1.845
0
0
0
0
1
443,593
0.233
1.632
0
0
1
1
1
233,109
0.122
0.858
1
1
1
0
0
132,462
0.07
0.487
1
0
0
0
1
66,408
0.035
0.244
0
1
0
0
1
60,111
0.032
0.221
1
1
0
0
0
38,147
0.02
0.14
1
0
1
1
0
30,652
0.016
0.113
0
0
1
0
1
13,104
0.007
0.048
1
0
1
0
1
9,026
0.005
0.033
0
1
1
1
0
5,847
0.003
0.022
1
1
1
1
0
5,248
0.003
0.019
0
1
1
1
1
24
0.0
0.0
0
0
0
1
0
14
0.0
0.0
0
1
1
0
0
3
0.0
0.0
Total
27,172,995
14.26
100
Instance items matched
Total items used: 588,668
Total queries that use these items: 1,421,708 (0.75% of all queries)
The distribution shows there are only 3 high usage(>100k queries) items, and the rest form a long tail of small usage (<1k queries) items in the taxon subgraph.
Note that these are for the queries from the month of November 2021. These data change from one month to another.
Top items that cause a query to be related to Taxon subgraph (Q16521)
Instance item
Instance item label
#of queries
Q15978631
Homo sapiens
148,455
Q83310
house mouse
111,473
Q184224
brown rat
111,397
Q25400
Asteraceae
67,794
Q729
animal
43,360
Q25308
Orchidaceae
22,905
Q756
plant
11,244
Q14560
Cactaceae
7,840
Q173756
Apocynaceae
7,018
Q526228
Acantharchus pomotis
4,373
Q36341
Brown Bear
1,878
Q80174
Pan
1,696
Q19537
bonobo
1,684
Q69581
Siberian tiger
1,664
Q171497
sunflower
1,637
Q504549
Spur-thighed tortoise
1,617
Q8202634
Apis mellifera sahariensis
1,438
Q41960
Ailurus fulgens
1,317
Q2346039
Thunnus
1,285
Q719725
Saccharomyces cerevisiae
1,244
Properties matched
Total properties used: 162 (Recall these are properties that occur 99% of the times in the taxon subgraph)
Total queries that use these properties: 1,657,324 (0.87% of all queries)
Most of these look like external IDs. Only 31 of these properties are not IDs.
The distribution shows there is 1 property with >1M queries, 7 properties with >100K queries, 14 properties with 2-8K queries, and rest of the properties match ~1K and less queries. In short, the distribution is a extremely skewed by only ~10 properties.
Top properties that cause a query to be related to Taxon subgraph (Q16521)
Property
Property label
#of queries
P3151
iNaturalist taxon ID
1,346,905
P141
IUCN conservation status
167,512
P183
endemic to
152,618
P961
IPNI plant ID
64,340
P938
FishBase species ID
39,494
P6018
SeaLifeBase ID
25,217
P574
year of taxon publication
12,599
P2040
CITES Species ID
12,073
P697
ex taxon author
10,710
P566
basionym
8,469
P5473
The Reptile Database ID
6,357
P5036
AmphibiaWeb Species ID
6,343
P7715
World Flora Online ID
4,868
P5626
Global Invasive Species Database ID
4,509
P5037
Plants of the World online ID
3,629
P6105
Observation.org ID
3,193
P960
Tropicos ID
2,922
P9157
Open Tree of Life ID
2,470
P1070
PlantList-ID
2,181
P1772
USDA PLANTS ID
2,167
Subject/Object URI matched
Total URIs used: 651,945 (Recall these are URIs that occur 99% of the times in the human subgraph)
Total queries that use these URIs: 24,510,707 (12.86% of all queries)
The top URI is in fact the Q-ID of taxon subgraph - Q16521 - and matches 12.19%of all queries. We look into the queries directly later in this section.
We analyze the top 100K URIs. Of these, 66% are Wikidata items, 31% are Wikipedia links.
The distribution shows that the top 2 URIs occur in queries tens of times greater than the other URIs. Of course this data is only for November 2021, but the high usage of Taxon Q-ID was also observed in October 2021 data.
Top URIs that cause a query to be related to Taxon subgraph (Q16521)
The Wikipedia links are from 80 different languages. The tables below show some of the top languages used in terms of unique links queried and shows the top 5 links for each of these languages.
Top language Wikipedias as URIs in taxon subgraph queries
Language: en, # unique links:13533
URI
# Query
en.wikipedia.org/wiki/Tokay_gecko
2,602
en.wikipedia.org/wiki/Bobtail_squid
1,834
en.wikipedia.org/wiki/Coronavirus
1,421
en.wikipedia.org/wiki/Fusarium
1,204
en.wikipedia.org/wiki/Tardigrade
1,202
Language: ja, # unique links:6611
URI
# Query
ja.wikipedia.org/wiki/キーウィ_(鳥)
4,827
ja.wikipedia.org/wiki/ガラパゴスリクイグアナ
2,309
ja.wikipedia.org/wiki/セミ
2,186
ja.wikipedia.org/wiki/ドードー
1,932
ja.wikipedia.org/wiki/ダイオウホウズキイカ
1,864
Language: es, # unique links:2908
URI
# Query
es.wikipedia.org/wiki/Eudocimus_ruber
977
es.wikipedia.org/wiki/Chelonioidea
610
es.wikipedia.org/wiki/Pelecanus
458
es.wikipedia.org/wiki/Fregata_magnificens
262
es.wikipedia.org/wiki/Ebolavirus
255
Language: de, # unique links:1950
URI
# Query
de.wikipedia.org/wiki/Coronaviridae
123
de.wikipedia.org/wiki/Taubenschwänzchen
95
de.wikipedia.org/wiki/Mariendistel
79
de.wikipedia.org/wiki/Wespenspinne
75
de.wikipedia.org/wiki/Schnaken
72
Language: fr, # unique links:1342
URI
# Query
fr.wikipedia.org/wiki/Homo_sapiens
201
fr.wikipedia.org/wiki/Coronavirus
107
fr.wikipedia.org/wiki/Tardigrada
80
fr.wikipedia.org/wiki/Scutigère_véloce
76
fr.wikipedia.org/wiki/Muguet_de_mai
55
Language: nl, # unique links:1056
URI
# Query
nl.wikipedia.org/wiki/Europese_hoornaar
105
nl.wikipedia.org/wiki/Vuurwants
66
nl.wikipedia.org/wiki/Stinkende_kortschildkever
61
nl.wikipedia.org/wiki/Stadsreus_(zweefvlieg)
48
nl.wikipedia.org/wiki/Coronavirussen
47
Language: ru, # unique links:800
URI
# Query
ru.wikipedia.org/wiki/Мимивирус
53
ru.wikipedia.org/wiki/Коронавирусы
50
ru.wikipedia.org/wiki/Обыкновенная_мухоловка
49
ru.wikipedia.org/wiki/Тихоходки
40
ru.wikipedia.org/wiki/Малая_панда
40
Language: pt, # unique links:520
URI
# Query
pt.wikipedia.org/wiki/Candiru
39
pt.wikipedia.org/wiki/Tangerina
36
pt.wikipedia.org/wiki/Tardigrada
34
pt.wikipedia.org/wiki/Grelo
33
pt.wikipedia.org/wiki/Panda-vermelho
31
Query time
The total query time of taxon subgraph is ~3% of total query time and total query count is 14.26% of all queries.
Average time per query is 0.064 seconds (64 ms). Almost all queries in this subgraph are small and simple.
The query time distribution is shown in the chart below: in absolute counts, in percent of queries in taxon subgraph, and in percent of all queries.
User agent
List of top user agents that query the taxon subgraph is given below. This helps us view the distribution of usage - whether few user agents dominate the usage or it is a rather well distributed usage scenario across user agents. Top 10 user agents in terms of query count and query time is shown in the table below. The query type column is discussed later in the section.
Top user agents in taxon subgraph
User agent
Query count
% query in taxon subgraph
% query overall
Query time(hr)
% query time in taxon subgraph
% query time overall
# query type
mix-n-match
22,959,293
84.493
12.048
163
33.949
1.063
5
Hub
1,318,251
4.851
0.692
17
3.455
0.108
1
WikidataQueryServiceR
568,799
2.093
0.298
9
1.837
0.058
34
UA#4
325,563
1.198
0.171
10
2.044
0.064
5
UA#5
265,565
0.977
0.139
2
0.495
0.015
5
UA#6
199,650
0.735
0.105
53
11.133
0.349
24
UA#7
168,536
0.62
0.088
2
0.441
0.014
2
sparqlwrapper
161,781
0.595
0.085
126
26.257
0.822
33
UA#9
107,736
0.396
0.057
3
0.627
0.02
1
EasyContent
103,292
0.38
0.054
2
0.346
0.011
1
UA#11
1,330
0.005
0.001
12
2.481
0.078
8
UA#12
45,065
0.166
0.024
8
1.644
0.051
9
AhrefsBot
56,580
0.208
0.03
8
1.575
0.049
12
Apache-Jena-ARQ
6,654
0.024
0.003
6
1.27
0.04
11
The query time breakdown was plotted for the top 20 user agents (in terms of time).
Query types
Query types are grouped by the operations a query uses and also the order of operations used. This groups similar queries together despite different information sought and also separates groups of simple or complicated queries. The taxon subgraph has ~1100 different types of queries (The variety is quite less compared to 11K query type in human subgraph). Notice that some query groups can be very similar in what they ask for as well, although most groups differ a lot. The top query groups are listed below. Only the top 3 types of queries account for ~90% of the queries of taxon subgraph. Top 40 form 99% of the queries in this subgraph. The rest form a long tail of small query counts.
Looking at the top 5 query groups (that make up >90% of all taxon subgraph queries), the following query types were found:
Search with taxon name, synonyms, altlabels etc
Search with taxon ID. E.g. SELECT DISTINCT ?subject WHERE { ?subject wdt:P3151 '47126' .}
Get labels of certain items
Get labels of certain items in specific languages
Get external IDs of items
UA vs query types
Getting the number of query types per user agent informs us of the variety of queries a user agent makes to WDQS. This also breaks down the taxon subgraph queries into finer groups. The following plot shows the number of query types for each user agent in the taxon subgraph.
This shows us that most user agents make only 1 type of query. Only 3 user agents make queries of >100 types, and 5 user agents make queries of 50-100 types. The figure below shows the number of query per query type for the top 8 user agents.
** The number of query types for the top user agents in terms of query count and time is listed in the Taxon User agent section.
Query type vs time class
While there are close to 1,100 query types and only the top 3 types of queries account for ~90% of the queries of taxon subgraph (12.8% of all queries), not all of them are equally time consuming. Some can be simple queries, while some can be long and complex. The following plot shows the top 10 query types with query time classes. The values above the bar show both percent in taxon subgraph and overall query percentage. The subplots are titled with percent of the number of queries in that query type.
In sum, most queries in taxon subgraph are small and simple, take between 10 to 100ms time to run, and are mostly by 1/2 user agents.
Services
The queries use 12 unique services. The top 4 services are the most used, although the usage is still pretty low; rest are used in less than 30 queries.
Some query type analysis done in section query types gives us a good idea of what kind of queries taxon subgraph receives. Looking at the triples themselves also helps us peek into what most of the queries look like, what the most common subjects, objects, and properties are. The table below lists these along with the top wikidata items and properties used overall.
Top triples in taxon subgraph queries
Subject
Predicate
Object
# query
% query of taxon subgraph
q
wdt:P31/(wdt:P279)*
wd:Q16521
22,956,806
84.484*
bd:serviceParam
wikibase:language
en
379,833
1.398
item
wdt:P225
taxonName
265,426
0.977
item
p:P105
taxonRank1
265,360
0.977
taxonRank1
ps:P105
taxonRank
265,360
0.977
item
rdfs:label
label
241,505
0.889
item
skos:altLabel
altLabel
236,495
0.870
bd:serviceParam
wikibase:language
[AUTO_LANGUAGE],en
205,404
0.756
items
rdfs:label
itemlabel
199,441
0.734
items
(wdt:P279)?
types
195,916
0.721
*coincides with the number of queries from mix-n-match. Almost all mix-in-match queries in taxon subgraph have this triple. And all of these triples only occur in mix-n-match queries.
Paths are more complex predicates that chain properties with logic. Complex paths can increase the scope of a query and also increase its runtime. The table below lists the most used paths in taxon subgraph queries. While most path are not very complex or long, there are a lot of variety in ways paths are formed to perform queries. Ordinary properties are not considered as paths. The following list contains not only the paths, but also their breakdown into components paths (as done by Jena ARQ while parsing SPARQL queries). For instance: (p:P31/ps:P31)/(wdt:P279)* is recorded as:
(p:P31/ps:P31)/(wdt:P279)*
(p:P31/ps:P31)
p:P31
ps:P31
(wdt:P279)*
wdt:P279
The unit form, wdt:P279 for example, was removed from the path list since they are part of other paths and not paths themselves. More paths that seemed obvious as being part of a longer path, and not paths themselves, were also removed from the list for better visualization of the distinct paths used in the queries.