Hyphe relies on a JsonRPC API that can be controlled easily through the web interface or called directly from a JsonRPC client.
Note: as it relies on the JSON-RPC protocol, it is not quite easy to test the API methods from a browser (having to send arguments through POST), but you can test directly from the command-line using the dedicated tools, see the Developers' documentation.
The current JSON-RPC 1.0 implementation requires to provide arguments as an ordered array of the methods arguments. Call with named arguments is possible but not well handled and not recommanded until we migrate to REST.
The API will always answer as such:
- Success:
{
"code": "success",
"result": "<The actual expected result, possibly an objet, an array, a number, a string, ...>"
}
- Error:
{
"code": "fail",
"message": "<A string describing the possible cause of the error.>"
}
- Default API commands (no namespace)
- CORPUS HANDLING
test_corpus
list_corpus
get_corpus_options
set_corpus_options
create_corpus
start_corpus
stop_corpus
get_corpus_tlds
backup_corpus
ping
reinitialize
destroy_corpus
clear_all
- CORE & CORPUS STATUS
get_status
- BASIC PAGE DECLARATION (AND WEBENTITY CREATION)
declare_page
declare_pages
- BASIC CRAWL METHODS
listjobs
propose_webentity_startpages
crawl_webentity
crawl_webentity_with_startmode
get_webentity_jobs
get_webentity_logs
- HTTP LOOKUP METHODS
lookup_httpstatus
lookup
- CORPUS HANDLING
- Commands for namespace: "crawl."
deploy_crawler
delete_crawler
cancel_all
start
cancel
get_job_logs
- Commands for namespace: "store."
- DEFINE WEBENTITIES
get_lru_definedprefixes
declare_webentity_by_lruprefix_as_url
declare_webentity_by_lru
declare_webentity_by_lrus_as_urls
declare_webentity_by_lrus
- EDIT WEBENTITIES
basic_edit_webentity
rename_webentity
change_webentity_id
set_webentity_status
set_webentities_status
set_webentity_homepage
add_webentity_lruprefixes
rm_webentity_lruprefix
add_webentity_startpage
rm_webentity_startpage
merge_webentity_into_another
merge_webentities_into_another
delete_webentity
- RETRIEVE & SEARCH WEBENTITIES
get_webentity
get_webentity_by_lruprefix
get_webentity_by_lruprefix_as_url
get_webentity_for_url
get_webentity_for_url_as_lru
get_webentities
advanced_search_webentities
exact_search_webentities
prefixed_search_webentities
postfixed_search_webentities
free_search_webentities
get_webentities_by_status
get_webentities_by_name
get_webentities_by_tag_value
get_webentities_by_tag_category
get_webentities_by_user_tag
get_webentities_mistagged
get_webentities_uncrawled
get_webentities_page
get_webentities_ranking_stats
- TAGS
add_webentity_tag_value
add_webentities_tag_value
rm_webentity_tag_key
rm_webentity_tag_value
set_webentity_tag_values
get_tags
get_tag_namespaces
get_tag_categories
get_tag_values
- PAGES, LINKS & NETWORKS
get_webentity_pages
get_webentity_mostlinked_pages
get_webentity_subwebentities
get_webentity_parentwebentities
get_webentity_nodelinks_network
get_webentities_network
- CREATION RULES
get_default_webentity_creationrule
get_webentity_creationrules
delete_webentity_creationrule
add_webentity_creationrule
simulate_creationrules_for_urls
simulate_creationrules_for_lrus
- PRECISION EXCEPTIONS
get_precision_exceptions
delete_precision_exceptions
add_precision_exception
- VARIOUS
trigger_links_build
trigger_links_reset
get_webentities_stats
- DEFINE WEBENTITIES
test_corpus
:
corpus
(optional, default:"--hyphe--"
)
Returns the current status of a corpus
: "ready"/"starting"/"stopped"/"error".
list_corpus
:
Returns the list of all existing corpora with metas.
get_corpus_options
:
corpus
(optional, default:"--hyphe--"
)
Returns detailed settings of a corpus
.
set_corpus_options
:
corpus
(optional, default:"--hyphe--"
)options
(optional, default:null
)
Updates the settings of a corpus
according to the keys/values provided in options
as a json object respecting the settings schema visible by querying get_corpus_options
. Returns the detailed settings.
create_corpus
:
name
(optional, default:"--hyphe--"
)password
(optional, default:""
)options
(optional, default:{}
)
Creates a corpus with the chosen name
and optional password
and options
(as a json object see set/get_corpus_options
). Returns the corpus generated id and status.
start_corpus
:
corpus
(optional, default:"--hyphe--"
)password
(optional, default:""
)
Starts an existing corpus
possibly password
-protected. Returns the new corpus status.
stop_corpus
:
corpus
(optional, default:"--hyphe--"
)
Stops an existing and running corpus
. Returns the new corpus status.
get_corpus_tlds
:
corpus
(optional, default:"--hyphe--"
)
Returns the lists of TLDs rules and exceptions built from Mozilla's list at the creation of corpus
.
backup_corpus
:
corpus
(optional, default:"--hyphe--"
)
Saves locally on the server in the archive directory a timestamped backup of corpus
including 4 json backup files of all webentities/links/crawls and corpus options.
ping
:
corpus
(optional, default:null
)timeout
(optional, default:3
)
Tests during timeout
seconds whether an existing corpus
is started. Returns "pong" on success or the corpus status otherwise.
reinitialize
:
corpus
(optional, default:"--hyphe--"
)
Resets completely a corpus
by cancelling all crawls and emptying the MemoryStructure and Mongo data.
destroy_corpus
:
corpus
(optional, default:"--hyphe--"
)
Resets a corpus
then definitely deletes anything associated with it.
clear_all
:
except_corpus_ids
(optional, default:[]
)
Resets Hyphe completely: starts then resets and destroys all existing corpora one by one except for those whose ID is given in except_corpus_ids
.
get_status
:
corpus
(optional, default:"--hyphe--"
)
Returns global metadata on Hyphe's status and specific information on a corpus
.
declare_page
:
url
(mandatory)corpus
(optional, default:"--hyphe--"
)
Indexes a url
into a corpus
. Returns the (newly created or not) associated WebEntity.
declare_pages
:
list_urls
(mandatory)corpus
(optional, default:"--hyphe--"
)
Indexes a bunch of urls given as an array in list_urls
into a corpus
. Returns the (newly created or not) associated WebEntities.
listjobs
:
list_ids
(optional, default:null
)from_ts
(optional, default:null
)to_ts
(optional, default:null
)corpus
(optional, default:"--hyphe--"
)
Returns the list and details of all "finished"/"running"/"pending" crawl jobs of a corpus
. Optionally returns only the jobs whose id is given in an array of list_ids
and/or that was created after timestamp from_ts
or before to_ts
.
propose_webentity_startpages
:
webentity_id
(mandatory)startmode
(optional, default:"default"
)categories
(optional, default:false
)save_startpages
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Returns a list of suggested startpages to crawl an existing WebEntity defined by its webentity_id
using the "default" startmode
defined for the corpus
or one or an array of either the WebEntity's preset "startpages", "homepage" or "prefixes" or most seen "pages-". Returns them categorised by type of source if "categories" is set to true. Will save them into the webentity if save_startpages
is True.
crawl_webentity
:
webentity_id
(mandatory)depth
(optional, default:0
)phantom_crawl
(optional, default:false
)status
(optional, default:"IN"
)phantom_timeouts
(optional, default:{}
)corpus
(optional, default:"--hyphe--"
)
Schedules a crawl for a corpus
for an existing WebEntity defined by its webentity_id
with a specific crawl depth [int]
.
Optionally use PhantomJS by setting phantom_crawl
to "true" and adjust specific phantom_timeouts
as a json object with possible keys timeout
/ajax_timeout
/idle_timeout
.
Sets simultaneously the WebEntity's status to "IN" or optionally to another valid status
("undecided"/"out"/"discovered").
Will use the WebEntity's startpages if it has any or use otherwise the corpus
' "default" startmode
heuristic as defined in propose_webentity_startpages
(use crawl_webentity_with_startmode
to apply a different heuristic, see details in propose_webentity_startpages
).
crawl_webentity_with_startmode
:
webentity_id
(mandatory)depth
(optional, default:0
)phantom_crawl
(optional, default:false
)status
(optional, default:"IN"
)startmode
(optional, default:"default"
)phantom_timeouts
(optional, default:{}
)corpus
(optional, default:"--hyphe--"
)
Schedules a crawl for a corpus
for an existing WebEntity defined by its webentity_id
with a specific crawl depth [int]
.
Optionally use PhantomJS by setting phantom_crawl
to "true" and adjust specific phantom_timeouts
as a json object with possible keys timeout
/ajax_timeout
/idle_timeout
.
Sets simultaneously the WebEntity's status to "IN" or optionally to another valid status
("undecided"/"out"/"discovered").
Optionally define the startmode
strategy differently to the corpus
"default one (see details in propose_webentity_startpages
).
get_webentity_jobs
:
webentity_id
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
crawl jobs that has run for a specific WebEntity defined by its webentity_id
.
get_webentity_logs
:
webentity_id
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
crawl activity logs on a specific WebEntity defined by its webentity_id
.
lookup_httpstatus
:
url
(mandatory)timeout
(optional, default:30
)corpus
(optional, default:"--hyphe--"
)
Tests a url
for timeout
seconds using a corpus
specific connection (possible proxy for instance). Returns the url's HTTP code.
lookup
:
url
(mandatory)timeout
(optional, default:30
)corpus
(optional, default:"--hyphe--"
)
Tests a url
for timeout
seconds using a corpus
specific connection (possible proxy for instance). Returns a boolean indicating whether lookup_httpstatus
returned HTTP code 200 or a redirection code (301/302/...).
deploy_crawler
:
corpus
(optional, default:"--hyphe--"
)
Prepares and deploys on the ScrapyD server a spider (crawler) for a corpus
.
delete_crawler
:
corpus
(optional, default:"--hyphe--"
)
Removes from the ScrapyD server an existing spider (crawler) for a corpus
.
cancel_all
:
corpus
(optional, default:"--hyphe--"
)
Stops all "running" and "pending" crawl jobs for a corpus
.
Cancels all current crawl jobs running or planned for a corpus
and empty related mongo data.
start
:
webentity_id
(mandatory)starts
(mandatory)follow_prefixes
(mandatory)nofollow_prefixes
(mandatory)follow_redirects
(optional, default:null
)depth
(optional, default:0
)phantom_crawl
(optional, default:false
)phantom_timeouts
(optional, default:{}
)download_delay
(optional, default:1
)corpus
(optional, default:"--hyphe--"
)
Starts a crawl for a corpus
defining finely the crawl options (mainly for debug purposes):
- a
webentity_id
associated with the crawl a list ofstarts
urls to start from - a list of
follow_prefixes
to know which links to follow - a list of
nofollow_prefixes
to know which links to avoid - a
depth
corresponding to the maximum number of clicks done from the start pages phantom_crawl
set to "true" to use PhantomJS for this crawl and optionalphantom_timeouts
as an object with keys amongtimeout
/ajax_timeout
/idle_timeout
- a
download_delay
corresponding to the time in seconds spent between two requests by the crawler.
cancel
:
job_id
(mandatory)corpus
(optional, default:"--hyphe--"
)
Cancels a crawl of id job_id
for a corpus
.
get_job_logs
:
job_id
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
activity logs of a specific crawl with id job_id
.
get_lru_definedprefixes
:
lru
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
a list of all possible LRU prefixes shorter than lru
and already attached to WebEntities.
declare_webentity_by_lruprefix_as_url
:
url
(mandatory)name
(optional, default:null
)status
(optional, default:null
)startPages
(optional, default:[]
)lruVariations
(optional, default:true
)corpus
(optional, default:"--hyphe--"
)
Creates for a corpus
a WebEntity defined for the LRU prefix given as a url
and optionnally for the corresponding http/https and www/no-www variations if lruVariations
is true. Optionally set the newly created WebEntity's name
status
("in"/"out"/"undecided"/"discovered") and list of startPages
. Returns the newly created WebEntity.
declare_webentity_by_lru
:
lru_prefix
(mandatory)name
(optional, default:null
)status
(optional, default:null
)startPages
(optional, default:[]
)lruVariations
(optional, default:true
)corpus
(optional, default:"--hyphe--"
)
Creates for a corpus
a WebEntity defined for a lru_prefix
and optionnally for the corresponding http/https and www/no-www variations if lruVariations
is true. Optionally set the newly created WebEntity's name
status
("in"/"out"/"undecided"/"discovered") and list of startPages
. Returns the newly created WebEntity.
declare_webentity_by_lrus_as_urls
:
list_urls
(mandatory)name
(optional, default:null
)status
(optional, default:null
)startPages
(optional, default:[]
)lruVariations
(optional, default:true
)corpus
(optional, default:"--hyphe--"
)
Creates for a corpus
a WebEntity defined for a set of LRU prefixes given as URLs under list_urls
and optionnally for the corresponding http/https and www/no-www variations if lruVariations
is true. Optionally set the newly created WebEntity's name
status
("in"/"out"/"undecided"/"discovered") and list of startPages
. Returns the newly created WebEntity.
declare_webentity_by_lrus
:
list_lrus
(mandatory)name
(optional, default:null
)status
(optional, default:null
)startPages
(optional, default:[]
)lruVariations
(optional, default:true
)corpus
(optional, default:"--hyphe--"
)
Creates for a corpus
a WebEntity defined for a set of LRU prefixes given as list_lrus
and optionnally for the corresponding http/https and www/no-www variations if lruVariations
is true. Optionally set the newly created WebEntity's name
status
("in"/"out"/"undecided"/"discovered") and list of startPages
. Returns the newly created WebEntity.
basic_edit_webentity
:
webentity_id
(mandatory)name
(optional, default:null
)status
(optional, default:null
)homepage
(optional, default:null
)corpus
(optional, default:"--hyphe--"
)
Changes for a corpus
at once the name
, status
and homepage
of a WebEntity defined by webentity_id
.
rename_webentity
:
webentity_id
(mandatory)new_name
(mandatory)corpus
(optional, default:"--hyphe--"
)
Changes for a corpus
the name of a WebEntity defined by webentity_id
to new_name
.
change_webentity_id
:
webentity_old_id
(mandatory)webentity_new_id
(mandatory)corpus
(optional, default:"--hyphe--"
)
Changes for a corpus
the id of a WebEntity defined by webentity_old_id
to webentity_new_id
(mainly for advanced debug use).
set_webentity_status
:
webentity_id
(mandatory)status
(mandatory)corpus
(optional, default:"--hyphe--"
)
Changes for a corpus
the status of a WebEntity defined by webentity_id
to status
(one of "in"/"out"/"undecided"/"discovered").
set_webentities_status
:
webentity_ids
(mandatory)status
(mandatory)corpus
(optional, default:"--hyphe--"
)
Changes for a corpus
the status of a set of WebEntities defined by a list of webentity_ids
to status
(one of "in"/"out"/"undecided"/"discovered").
set_webentity_homepage
:
webentity_id
(mandatory)homepage
(optional, default:""
)corpus
(optional, default:"--hyphe--"
)
Changes for a corpus
the homepage of a WebEntity defined by webentity_id
to homepage
.
add_webentity_lruprefixes
:
webentity_id
(mandatory)lru_prefixes
(mandatory)corpus
(optional, default:"--hyphe--"
)
Adds for a corpus
a list of lru_prefixes
(or a single one) to a WebEntity defined by webentity_id
.
rm_webentity_lruprefix
:
webentity_id
(mandatory)lru_prefix
(mandatory)corpus
(optional, default:"--hyphe--"
)
Removes for a corpus
a lru_prefix
from the list of prefixes of a WebEntity defined by `webentity_id. Will delete the WebEntity if it ends up with no LRU prefix left.
add_webentity_startpage
:
webentity_id
(mandatory)startpage_url
(mandatory)corpus
(optional, default:"--hyphe--"
)
Adds for a corpus
a list of lru_prefixes
to a WebEntity defined by webentity_id
.
rm_webentity_startpage
:
webentity_id
(mandatory)startpage_url
(mandatory)corpus
(optional, default:"--hyphe--"
)
Removes for a corpus
a startpage_url
from the list of startpages of a WebEntity defined by `webentity_id.
merge_webentity_into_another
:
old_webentity_id
(mandatory)good_webentity_id
(mandatory)include_tags
(optional, default:false
)include_home_and_startpages_as_startpages
(optional, default:false
)include_name_and_status
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Assembles for a corpus
2 WebEntities by deleting WebEntity defined by old_webentity_id
and adding all of its LRU prefixes to the one defined by good_webentity_id
. Optionally set include_tags
and/or include_home_and_startpages_as_startpages
and/or include_name_and_status
to "true" to also add the tags and/or startpages and/or name&status to the merged resulting WebEntity.
merge_webentities_into_another
:
old_webentity_ids
(mandatory)good_webentity_id
(mandatory)include_tags
(optional, default:false
)include_home_and_startpages_as_startpages
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Assembles for a corpus
a bunch of WebEntities by deleting WebEntities defined by a list of old_webentity_ids
and adding all of their LRU prefixes to the one defined by good_webentity_id
. Optionally set include_tags
and/or include_home_and_startpages_as_startpages
to "true" to also add the tags and/or startpages to the merged resulting WebEntity.
delete_webentity
:
webentity_id
(mandatory)corpus
(optional, default:"--hyphe--"
)
Removes from a corpus
a WebEntity defined by webentity_id
(mainly for advanced debug use).
get_webentity
:
webentity_id
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
a WebEntity defined by its webentity_id
.
get_webentity_by_lruprefix
:
lru_prefix
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
the WebEntity having lru_prefix
as one of its LRU prefixes.
get_webentity_by_lruprefix_as_url
:
url
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
the WebEntity having one of its LRU prefixes corresponding to the LRU fiven under the form of a url
.
get_webentity_for_url
:
url
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
the WebEntity to which a url
belongs (meaning starting with one of the WebEntity's prefix and not another).
get_webentity_for_url_as_lru
:
lru
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
the WebEntity to which a url given under the form of a lru
belongs (meaning starting with one of the WebEntity's prefix and not another).
get_webentities
:
list_ids
(optional, default:[]
)sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)light
(optional, default:false
)semilight
(optional, default:false
)light_for_csv
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all existing WebEntities or only the WebEntities whose id is among list_ids. Results will be paginated with a total number of returned results of
countand
pagethe number of the desired page of results. Returns all results at once if
list_idsis provided or
count
_ (optional, default: = -1 ; otherwise results will include metadata on the request including the total number of results and a
tokento be reused to collect the other pages via
get_webentities_page.
)
Other possible options include:
- order the results with
sort
by inputting a field or list of fields as named in the WebEntities returned objects; optionally prefix a sort field with a "-" to revert the sorting on it; for instance:["-indegree", "name"]
will order by maximum indegree first then by alphabetic order of names - set
light
orsemilight
orlight_for_csv
to "true" to collect lighter data with less WebEntities fields.
advanced_search_webentities
:
allFieldsKeywords
(optional, default:[]
)fieldKeywords
(optional, default:[]
)sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)autoescape_query
(optional, default:true
)light
(optional, default:false
)semilight
(optional, default:true
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all WebEntities matching a specific search using the allFieldsKeywords
and fieldKeywords
arguments. Searched keywords will automatically be escaped: set autoescape_query
to "false" to allow input of special Lucene queries.
Returns all results at once if count
_ (optional, default:
= -1 ; otherwise results will be paginated with a total number of returned results of count
and page
the number of the desired page of results. Results will include metadata on the request including the total number of results and a token
to be reused to collect the other pages via get_webentities_page
.`)
allFieldsKeywords
should be a string or list of strings to search in all textual fields of the WebEntities ("name"/"status"/"lruset"/"startpages"/...). For instance["hyphe", "www"]
fieldKeywords
should be a list of 2-elements arrays giving first the field to search into then the searched value or optionally for the field "indegree" an array of a minimum and maximum values to search into. For instance:[["name", "hyphe"], ["indegree", [3, 1000]]]
- see description of
sort
light
andsemilight
inget_webentities
above.
exact_search_webentities
:
query
(mandatory)field
(optional, default:null
)sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all WebEntities having one textual field or optional specific field
exactly equal to the value given as query
. Searched query will automatically be escaped of Lucene special characters.
Results are paginated and will include a token
to be reused to collect the other pages via get_webentities_page
: see advanced_search_webentities
for explanations on sort
count
and page
.
prefixed_search_webentities
:
query
(mandatory)field
(optional, default:null
)sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all WebEntities having one textual field or optional specific field
beginning with the value given as query
. Searched query will automatically be escaped of Lucene special characters.
Results are paginated and will include a token
to be reused to collect the other pages via get_webentities_page
: see advanced_search_webentities
for explanations on sort
count
and page
.
postfixed_search_webentities
:
query
(mandatory)field
(optional, default:null
)sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all WebEntities having one textual field or optional specific field
finishing with the value given as query
. Searched query will automatically be escaped of Lucene special characters.
Results are paginated and will include a token
to be reused to collect the other pages via get_webentities_page
: see advanced_search_webentities
for explanations on sort
count
and page
.
free_search_webentities
:
query
(mandatory)field
(optional, default:null
)sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all WebEntities having one textual field or optional specific field
containing the value given as query
. Searched query will automatically be escaped of Lucene special characters.
Results are paginated and will include a token
to be reused to collect the other pages via get_webentities_page
: see advanced_search_webentities
for explanations on sort
count
and page
.
get_webentities_by_status
:
status
(mandatory)sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all WebEntities having their status equal to status
(one of "in"/"out"/"undecided"/"discovered").
Results are paginated and will include a token
to be reused to collect the other pages via get_webentities_page
: see advanced_search_webentities
for explanations on sort
count
and page
.
get_webentities_by_name
:
name
(mandatory)sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all WebEntities having their name equal to name
.
Results are paginated and will include a token
to be reused to collect the other pages via get_webentities_page
: see advanced_search_webentities
for explanations on sort
count
and page
.
get_webentities_by_tag_value
:
value
(mandatory)sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all WebEntities having at least one tag in any namespace/category equal to value
.
Results are paginated and will include a token
to be reused to collect the other pages via get_webentities_page
: see advanced_search_webentities
for explanations on sort
count
and page
.
get_webentities_by_tag_category
:
category
(mandatory)sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all WebEntities having at least one tag in a specific category
.
Results are paginated and will include a token
to be reused to collect the other pages via get_webentities_page
: see advanced_search_webentities
for explanations on sort
count
and page
.
get_webentities_by_user_tag
:
category
(mandatory)value
(mandatory)sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all WebEntities having at least one tag in any category of the namespace "USER" equal to value
.
Results are paginated and will include a token
to be reused to collect the other pages via get_webentities_page
: see advanced_search_webentities
for explanations on sort
count
and page
.
get_webentities_mistagged
:
status
(optional, default:'IN'
)missing_a_category
(optional, default:false
)multiple_values
(optional, default:false
)sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all WebEntities of status status
with no tag of the namespace "USER" or multiple tags for some USER categories if multiple_values
is true or no tag for at least one existing USER category if missing_a_category
is true.
Results are paginated and will include a token
to be reused to collect the other pages via get_webentities_page
: see advanced_search_webentities
for explanations on sort
count
and page
.
get_webentities_uncrawled
:
sort
(optional, default:null
)count
(optional, default:100
)page
(optional, default:0
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all IN WebEntities which have no crawljob associated with it.
Results are paginated and will include a token
to be reused to collect the other pages via get_webentities_page
: see advanced_search_webentities
for explanations on sort
count
and page
.
get_webentities_page
:
pagination_token
(mandatory)n_page
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
the page number n_page
of WebEntities corresponding to the results of a previous query ran using any of the get_webentities
or search_webentities
methods using the returned pagination_token
.
get_webentities_ranking_stats
:
pagination_token
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
histogram data on the indegrees of all WebEntities matching a previous query ran using any of the get_webentities
or search_webentities
methods using the return pagination_token
.
add_webentity_tag_value
:
webentity_id
(mandatory)namespace
(mandatory)category
(mandatory)value
(mandatory)corpus
(optional, default:"--hyphe--"
)
Adds for a corpus
a tag namespace:category
_ (optional, default: value
to a WebEntity defined by webentity_id
.`)
add_webentities_tag_value
:
webentity_ids
(mandatory)namespace
(mandatory)category
(mandatory)value
(mandatory)corpus
(optional, default:"--hyphe--"
)
Adds for a corpus
a tag namespace:category
_ (optional, default: value
to a bunch of WebEntities defined by a list of webentity_ids
.`)
rm_webentity_tag_key
:
webentity_id
(mandatory)namespace
(mandatory)category
(mandatory)corpus
(optional, default:"--hyphe--"
)
Removes for a corpus
all tags within namespace:category
associated with a WebEntity defined by webentity_id
if it is set.
rm_webentity_tag_value
:
webentity_id
(mandatory)namespace
(mandatory)category
(mandatory)value
(mandatory)corpus
(optional, default:"--hyphe--"
)
Removes for a corpus
a tag namespace:category
_ (optional, default: value
associated with a WebEntity defined by webentity_id
if it is set.`)
set_webentity_tag_values
:
webentity_id
(mandatory)namespace
(mandatory)category
(mandatory)values
(mandatory)corpus
(optional, default:"--hyphe--"
)
Replaces for a corpus
all existing tags of a WebEntity defined by webentity_id
for a specific namespace
and category
by a list of values
or a single tag.
get_tags
:
namespace
(optional, default:null
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
a tree of all existing tags of the webentities hierarchised by namespaces and categories. Optionally limits to a specific namespace
.
get_tag_namespaces
:
corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
a list of all existing namespaces of the webentities tags.
get_tag_categories
:
namespace
(optional, default:null
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
a list of all existing categories of the webentities tags. Optionally limits to a specific namespace
.
get_tag_values
:
namespace
(optional, default:null
)category
(optional, default:null
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
a list of all existing values in the webentities tags. Optionally limits to a specific namespace
and/or category
.
- _`LINKS & NETWORKS
get_webentity_pages
:
webentity_id
(mandatory)onlyCrawled
(optional, default:true
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all indexed Pages fitting within the WebEntity defined by webentity_id
. Optionally limits the results to Pages which were actually crawled setting onlyCrawled
to "true".
get_webentity_mostlinked_pages
:
webentity_id
(mandatory)npages
(optional, default:20
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
the npages
(defaults to 20) most linked Pages indexed that fit within the WebEntity defined by webentity_id
.
get_webentity_subwebentities
:
webentity_id
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all sub-webentities of a WebEntity defined by webentity_id
(meaning webentities having at least one LRU prefix starting with one of the WebEntity's prefixes).
get_webentity_parentwebentities
:
webentity_id
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all parent-webentities of a WebEntity defined by webentity_id
(meaning webentities having at least one LRU prefix starting like one of the WebEntity's prefixes).
get_webentity_nodelinks_network
:
webentity_id
(optional, default:null
)include_external_links
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
the list of all internal NodeLinks of a WebEntity defined by webentity_id
. Optionally add external NodeLinks (the frontier) by setting include_external_links
to "true".
get_webentities_network
:
corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
the list of all agregated weighted links between WebEntities.
get_default_webentity_creationrule
:
corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
the default WebEntityCreationRule.
get_webentity_creationrules
:
lru_prefix
(optional, default:null
)corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
all existing WebEntityCreationRules or only one set for a specific lru_prefix
.
delete_webentity_creationrule
:
lru_prefix
(mandatory)corpus
(optional, default:"--hyphe--"
)
Removes from a corpus
an existing WebEntityCreationRule set for a specific lru_prefix
.
add_webentity_creationrule
:
lru_prefix
(mandatory)regexp
(mandatory)apply_to_existing_pages
(optional, default:false
)corpus
(optional, default:"--hyphe--"
)
Adds to a corpus
a new WebEntityCreationRule set for a lru_prefix
to a specific regexp
or one of "subdomain"/"subdomain-N"/"domain"/"path-N"/"prefix N"/"page" N being an integer. Optionally set apply_to_existing_pages
to "true" to apply it immediately to past crawls.
simulate_creationrules_for_urls
:
pageURLs
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns an object giving for each URL of pageURLs
(single string or array) the prefix of the theoretical WebEntity the URL would be attached to within a corpus
following its specific WebEntityCreationRules.
simulate_creationrules_for_lrus
:
pageLRUs
(mandatory)corpus
(optional, default:"--hyphe--"
)
Returns an object giving for each LRU of pageLRUs
(single string or array) the prefix of the theoretical WebEntity the LRU would be attached to within a corpus
following its specific WebEntityCreationRules.
get_precision_exceptions
:
corpus
(optional, default:"--hyphe--"
)
Returns for a corpus
the list of all existing PrecisionExceptions.
delete_precision_exceptions
:
list_lru_exceptions
(mandatory)corpus
(optional, default:"--hyphe--"
)
Removes from a corpus
a set of existing PrecisionExceptions listed as list_lru_exceptions
.
add_precision_exception
:
lru_prefix
(mandatory)corpus
(optional, default:"--hyphe--"
)
Adds to a corpus
a new PrecisionException for lru_prefix
.
trigger_links_build
:
corpus
(optional, default:"--hyphe--"
)
Will initiate a links calculation update (useful especially when a corpus crashed during the links calculation and no more crawls is programmed).
trigger_links_reset
:
corpus
(optional, default:"--hyphe--"
)
Will initiate a whole reset and regeneration of all WebEntityLinks of a corpus
. Can take a while.
get_webentities_stats
:
corpus
(optional, default:"--hyphe--"
)
Returns for a corpus a set of statistics on the WebEntities status repartition of a corpus
each 5 minutes.