Skip to content

Latest commit

 

History

History
973 lines (673 loc) · 39.8 KB

api.md

File metadata and controls

973 lines (673 loc) · 39.8 KB

API documentation

Hyphe relies on a JsonRPC API that can be controlled easily through the web interface or called directly from a JsonRPC client.

Note: as it relies on the JSON-RPC protocol, it is not quite easy to test the API methods from a browser (having to send arguments through POST), but you can test directly from the command-line using the dedicated tools, see the Developers' documentation.

Data & Query format

The current JSON-RPC 1.0 implementation requires to provide arguments as an ordered array of the methods arguments. Call with named arguments is possible but not well handled and not recommanded until we migrate to REST.

The API will always answer as such:

  • Success:
{
  "code": "success",
  "result": "<The actual expected result, possibly an objet, an array, a number, a string, ...>"
}
  • Error:
{
  "code": "fail",
  "message": "<A string describing the possible cause of the error.>"
}

Summary

  • Default API commands (no namespace)
  • Commands for namespace: "crawl."
    • deploy_crawler
    • delete_crawler
    • cancel_all
    • start
    • cancel
    • get_job_logs
  • Commands for namespace: "store."
    • DEFINE WEBENTITIES
      • get_lru_definedprefixes
      • declare_webentity_by_lruprefix_as_url
      • declare_webentity_by_lru
      • declare_webentity_by_lrus_as_urls
      • declare_webentity_by_lrus
    • EDIT WEBENTITIES
      • basic_edit_webentity
      • rename_webentity
      • change_webentity_id
      • set_webentity_status
      • set_webentities_status
      • set_webentity_homepage
      • add_webentity_lruprefixes
      • rm_webentity_lruprefix
      • add_webentity_startpage
      • rm_webentity_startpage
      • merge_webentity_into_another
      • merge_webentities_into_another
      • delete_webentity
    • RETRIEVE & SEARCH WEBENTITIES
      • get_webentity
      • get_webentity_by_lruprefix
      • get_webentity_by_lruprefix_as_url
      • get_webentity_for_url
      • get_webentity_for_url_as_lru
      • get_webentities
      • advanced_search_webentities
      • exact_search_webentities
      • prefixed_search_webentities
      • postfixed_search_webentities
      • free_search_webentities
      • get_webentities_by_status
      • get_webentities_by_name
      • get_webentities_by_tag_value
      • get_webentities_by_tag_category
      • get_webentities_by_user_tag
      • get_webentities_mistagged
      • get_webentities_uncrawled
      • get_webentities_page
      • get_webentities_ranking_stats
    • TAGS
      • add_webentity_tag_value
      • add_webentities_tag_value
      • rm_webentity_tag_key
      • rm_webentity_tag_value
      • set_webentity_tag_values
      • get_tags
      • get_tag_namespaces
      • get_tag_categories
      • get_tag_values
    • PAGES, LINKS & NETWORKS
      • get_webentity_pages
      • get_webentity_mostlinked_pages
      • get_webentity_subwebentities
      • get_webentity_parentwebentities
      • get_webentity_nodelinks_network
      • get_webentities_network
    • CREATION RULES
      • get_default_webentity_creationrule
      • get_webentity_creationrules
      • delete_webentity_creationrule
      • add_webentity_creationrule
      • simulate_creationrules_for_urls
      • simulate_creationrules_for_lrus
    • PRECISION EXCEPTIONS
      • get_precision_exceptions
      • delete_precision_exceptions
      • add_precision_exception
    • VARIOUS
      • trigger_links_build
      • trigger_links_reset
      • get_webentities_stats

Default API commands (no namespace)

CORPUS HANDLING

  • test_corpus:
  • corpus (optional, default: "--hyphe--")

Returns the current status of a corpus: "ready"/"starting"/"stopped"/"error".

  • list_corpus:

Returns the list of all existing corpora with metas.

  • get_corpus_options:
  • corpus (optional, default: "--hyphe--")

Returns detailed settings of a corpus.

  • set_corpus_options:
  • corpus (optional, default: "--hyphe--")
  • options (optional, default: null)

Updates the settings of a corpus according to the keys/values provided in options as a json object respecting the settings schema visible by querying get_corpus_options. Returns the detailed settings.

  • create_corpus:
  • name (optional, default: "--hyphe--")
  • password (optional, default: "")
  • options (optional, default: {})

Creates a corpus with the chosen name and optional password and options (as a json object see set/get_corpus_options). Returns the corpus generated id and status.

  • start_corpus:
  • corpus (optional, default: "--hyphe--")
  • password (optional, default: "")

Starts an existing corpus possibly password-protected. Returns the new corpus status.

  • stop_corpus:
  • corpus (optional, default: "--hyphe--")

Stops an existing and running corpus. Returns the new corpus status.

  • get_corpus_tlds:
  • corpus (optional, default: "--hyphe--")

Returns the lists of TLDs rules and exceptions built from Mozilla's list at the creation of corpus.

  • backup_corpus:
  • corpus (optional, default: "--hyphe--")

Saves locally on the server in the archive directory a timestamped backup of corpus including 4 json backup files of all webentities/links/crawls and corpus options.

  • ping:
  • corpus (optional, default: null)
  • timeout (optional, default: 3)

Tests during timeout seconds whether an existing corpus is started. Returns "pong" on success or the corpus status otherwise.

  • reinitialize:
  • corpus (optional, default: "--hyphe--")

Resets completely a corpus by cancelling all crawls and emptying the MemoryStructure and Mongo data.

  • destroy_corpus:
  • corpus (optional, default: "--hyphe--")

Resets a corpus then definitely deletes anything associated with it.

  • clear_all:
  • except_corpus_ids (optional, default: [])

Resets Hyphe completely: starts then resets and destroys all existing corpora one by one except for those whose ID is given in except_corpus_ids.

CORE & CORPUS STATUS

  • get_status:
  • corpus (optional, default: "--hyphe--")

Returns global metadata on Hyphe's status and specific information on a corpus.

BASIC PAGE DECLARATION (AND WEBENTITY CREATION)

  • declare_page:
  • url (mandatory)
  • corpus (optional, default: "--hyphe--")

Indexes a url into a corpus. Returns the (newly created or not) associated WebEntity.

  • declare_pages:
  • list_urls (mandatory)
  • corpus (optional, default: "--hyphe--")

Indexes a bunch of urls given as an array in list_urls into a corpus. Returns the (newly created or not) associated WebEntities.

BASIC CRAWL METHODS

  • listjobs:
  • list_ids (optional, default: null)
  • from_ts (optional, default: null)
  • to_ts (optional, default: null)
  • corpus (optional, default: "--hyphe--")

Returns the list and details of all "finished"/"running"/"pending" crawl jobs of a corpus. Optionally returns only the jobs whose id is given in an array of list_ids and/or that was created after timestamp from_ts or before to_ts.

  • propose_webentity_startpages:
  • webentity_id (mandatory)
  • startmode (optional, default: "default")
  • categories (optional, default: false)
  • save_startpages (optional, default: false)
  • corpus (optional, default: "--hyphe--")

Returns a list of suggested startpages to crawl an existing WebEntity defined by its webentity_id using the "default" startmode defined for the corpus or one or an array of either the WebEntity's preset "startpages", "homepage" or "prefixes" or most seen "pages-". Returns them categorised by type of source if "categories" is set to true. Will save them into the webentity if save_startpages is True.

  • crawl_webentity:
  • webentity_id (mandatory)
  • depth (optional, default: 0)
  • phantom_crawl (optional, default: false)
  • status (optional, default: "IN")
  • phantom_timeouts (optional, default: {})
  • corpus (optional, default: "--hyphe--")

Schedules a crawl for a corpus for an existing WebEntity defined by its webentity_id with a specific crawl depth [int]. Optionally use PhantomJS by setting phantom_crawl to "true" and adjust specific phantom_timeouts as a json object with possible keys timeout/ajax_timeout/idle_timeout. Sets simultaneously the WebEntity's status to "IN" or optionally to another valid status ("undecided"/"out"/"discovered"). Will use the WebEntity's startpages if it has any or use otherwise the corpus' "default" startmode heuristic as defined in propose_webentity_startpages (use crawl_webentity_with_startmode to apply a different heuristic, see details in propose_webentity_startpages).

  • crawl_webentity_with_startmode:
  • webentity_id (mandatory)
  • depth (optional, default: 0)
  • phantom_crawl (optional, default: false)
  • status (optional, default: "IN")
  • startmode (optional, default: "default")
  • phantom_timeouts (optional, default: {})
  • corpus (optional, default: "--hyphe--")

Schedules a crawl for a corpus for an existing WebEntity defined by its webentity_id with a specific crawl depth [int]. Optionally use PhantomJS by setting phantom_crawl to "true" and adjust specific phantom_timeouts as a json object with possible keys timeout/ajax_timeout/idle_timeout. Sets simultaneously the WebEntity's status to "IN" or optionally to another valid status ("undecided"/"out"/"discovered"). Optionally define the startmode strategy differently to the corpus "default one (see details in propose_webentity_startpages).

  • get_webentity_jobs:
  • webentity_id (mandatory)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus crawl jobs that has run for a specific WebEntity defined by its webentity_id.

  • get_webentity_logs:
  • webentity_id (mandatory)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus crawl activity logs on a specific WebEntity defined by its webentity_id.

HTTP LOOKUP METHODS

  • lookup_httpstatus:
  • url (mandatory)
  • timeout (optional, default: 30)
  • corpus (optional, default: "--hyphe--")

Tests a url for timeout seconds using a corpus specific connection (possible proxy for instance). Returns the url's HTTP code.

  • lookup:
  • url (mandatory)
  • timeout (optional, default: 30)
  • corpus (optional, default: "--hyphe--")

Tests a url for timeout seconds using a corpus specific connection (possible proxy for instance). Returns a boolean indicating whether lookup_httpstatus returned HTTP code 200 or a redirection code (301/302/...).

Commands for namespace: "crawl."

  • deploy_crawler:
  • corpus (optional, default: "--hyphe--")

Prepares and deploys on the ScrapyD server a spider (crawler) for a corpus.

  • delete_crawler:
  • corpus (optional, default: "--hyphe--")

Removes from the ScrapyD server an existing spider (crawler) for a corpus.

  • cancel_all:
  • corpus (optional, default: "--hyphe--")

Stops all "running" and "pending" crawl jobs for a corpus.

Cancels all current crawl jobs running or planned for a corpus and empty related mongo data.

  • start:
  • webentity_id (mandatory)
  • starts (mandatory)
  • follow_prefixes (mandatory)
  • nofollow_prefixes (mandatory)
  • follow_redirects (optional, default: null)
  • depth (optional, default: 0)
  • phantom_crawl (optional, default: false)
  • phantom_timeouts (optional, default: {})
  • download_delay (optional, default: 1)
  • corpus (optional, default: "--hyphe--")

Starts a crawl for a corpus defining finely the crawl options (mainly for debug purposes):

  • a webentity_id associated with the crawl a list of starts urls to start from
  • a list of follow_prefixes to know which links to follow
  • a list of nofollow_prefixes to know which links to avoid
  • a depth corresponding to the maximum number of clicks done from the start pages
  • phantom_crawl set to "true" to use PhantomJS for this crawl and optional phantom_timeouts as an object with keys among timeout/ajax_timeout/idle_timeout
  • a download_delay corresponding to the time in seconds spent between two requests by the crawler.
  • cancel:
  • job_id (mandatory)
  • corpus (optional, default: "--hyphe--")

Cancels a crawl of id job_id for a corpus.

  • get_job_logs:
  • job_id (mandatory)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus activity logs of a specific crawl with id job_id.

Commands for namespace: "store."

DEFINE WEBENTITIES

  • get_lru_definedprefixes:
  • lru (mandatory)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus a list of all possible LRU prefixes shorter than lru and already attached to WebEntities.

  • declare_webentity_by_lruprefix_as_url:
  • url (mandatory)
  • name (optional, default: null)
  • status (optional, default: null)
  • startPages (optional, default: [])
  • lruVariations (optional, default: true)
  • corpus (optional, default: "--hyphe--")

Creates for a corpus a WebEntity defined for the LRU prefix given as a url and optionnally for the corresponding http/https and www/no-www variations if lruVariations is true. Optionally set the newly created WebEntity's name status ("in"/"out"/"undecided"/"discovered") and list of startPages. Returns the newly created WebEntity.

  • declare_webentity_by_lru:
  • lru_prefix (mandatory)
  • name (optional, default: null)
  • status (optional, default: null)
  • startPages (optional, default: [])
  • lruVariations (optional, default: true)
  • corpus (optional, default: "--hyphe--")

Creates for a corpus a WebEntity defined for a lru_prefix and optionnally for the corresponding http/https and www/no-www variations if lruVariations is true. Optionally set the newly created WebEntity's name status ("in"/"out"/"undecided"/"discovered") and list of startPages. Returns the newly created WebEntity.

  • declare_webentity_by_lrus_as_urls:
  • list_urls (mandatory)
  • name (optional, default: null)
  • status (optional, default: null)
  • startPages (optional, default: [])
  • lruVariations (optional, default: true)
  • corpus (optional, default: "--hyphe--")

Creates for a corpus a WebEntity defined for a set of LRU prefixes given as URLs under list_urls and optionnally for the corresponding http/https and www/no-www variations if lruVariations is true. Optionally set the newly created WebEntity's name status ("in"/"out"/"undecided"/"discovered") and list of startPages. Returns the newly created WebEntity.

  • declare_webentity_by_lrus:
  • list_lrus (mandatory)
  • name (optional, default: null)
  • status (optional, default: null)
  • startPages (optional, default: [])
  • lruVariations (optional, default: true)
  • corpus (optional, default: "--hyphe--")

Creates for a corpus a WebEntity defined for a set of LRU prefixes given as list_lrus and optionnally for the corresponding http/https and www/no-www variations if lruVariations is true. Optionally set the newly created WebEntity's name status ("in"/"out"/"undecided"/"discovered") and list of startPages. Returns the newly created WebEntity.

EDIT WEBENTITIES

  • basic_edit_webentity:
  • webentity_id (mandatory)
  • name (optional, default: null)
  • status (optional, default: null)
  • homepage (optional, default: null)
  • corpus (optional, default: "--hyphe--")

Changes for a corpus at once the name, status and homepage of a WebEntity defined by webentity_id.

  • rename_webentity:
  • webentity_id (mandatory)
  • new_name (mandatory)
  • corpus (optional, default: "--hyphe--")

Changes for a corpus the name of a WebEntity defined by webentity_id to new_name.

  • change_webentity_id:
  • webentity_old_id (mandatory)
  • webentity_new_id (mandatory)
  • corpus (optional, default: "--hyphe--")

Changes for a corpus the id of a WebEntity defined by webentity_old_id to webentity_new_id (mainly for advanced debug use).

  • set_webentity_status:
  • webentity_id (mandatory)
  • status (mandatory)
  • corpus (optional, default: "--hyphe--")

Changes for a corpus the status of a WebEntity defined by webentity_id to status (one of "in"/"out"/"undecided"/"discovered").

  • set_webentities_status:
  • webentity_ids (mandatory)
  • status (mandatory)
  • corpus (optional, default: "--hyphe--")

Changes for a corpus the status of a set of WebEntities defined by a list of webentity_ids to status (one of "in"/"out"/"undecided"/"discovered").

  • set_webentity_homepage:
  • webentity_id (mandatory)
  • homepage (optional, default: "")
  • corpus (optional, default: "--hyphe--")

Changes for a corpus the homepage of a WebEntity defined by webentity_id to homepage.

  • add_webentity_lruprefixes:
  • webentity_id (mandatory)
  • lru_prefixes (mandatory)
  • corpus (optional, default: "--hyphe--")

Adds for a corpus a list of lru_prefixes (or a single one) to a WebEntity defined by webentity_id.

  • rm_webentity_lruprefix:
  • webentity_id (mandatory)
  • lru_prefix (mandatory)
  • corpus (optional, default: "--hyphe--")

Removes for a corpus a lru_prefix from the list of prefixes of a WebEntity defined by `webentity_id. Will delete the WebEntity if it ends up with no LRU prefix left.

  • add_webentity_startpage:
  • webentity_id (mandatory)
  • startpage_url (mandatory)
  • corpus (optional, default: "--hyphe--")

Adds for a corpus a list of lru_prefixes to a WebEntity defined by webentity_id.

  • rm_webentity_startpage:
  • webentity_id (mandatory)
  • startpage_url (mandatory)
  • corpus (optional, default: "--hyphe--")

Removes for a corpus a startpage_url from the list of startpages of a WebEntity defined by `webentity_id.

  • merge_webentity_into_another:
  • old_webentity_id (mandatory)
  • good_webentity_id (mandatory)
  • include_tags (optional, default: false)
  • include_home_and_startpages_as_startpages (optional, default: false)
  • include_name_and_status (optional, default: false)
  • corpus (optional, default: "--hyphe--")

Assembles for a corpus 2 WebEntities by deleting WebEntity defined by old_webentity_id and adding all of its LRU prefixes to the one defined by good_webentity_id. Optionally set include_tags and/or include_home_and_startpages_as_startpages and/or include_name_and_status to "true" to also add the tags and/or startpages and/or name&status to the merged resulting WebEntity.

  • merge_webentities_into_another:
  • old_webentity_ids (mandatory)
  • good_webentity_id (mandatory)
  • include_tags (optional, default: false)
  • include_home_and_startpages_as_startpages (optional, default: false)
  • corpus (optional, default: "--hyphe--")

Assembles for a corpus a bunch of WebEntities by deleting WebEntities defined by a list of old_webentity_ids and adding all of their LRU prefixes to the one defined by good_webentity_id. Optionally set include_tags and/or include_home_and_startpages_as_startpages to "true" to also add the tags and/or startpages to the merged resulting WebEntity.

  • delete_webentity:
  • webentity_id (mandatory)
  • corpus (optional, default: "--hyphe--")

Removes from a corpus a WebEntity defined by webentity_id (mainly for advanced debug use).

RETRIEVE & SEARCH WEBENTITIES

  • get_webentity:
  • webentity_id (mandatory)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus a WebEntity defined by its webentity_id.

  • get_webentity_by_lruprefix:
  • lru_prefix (mandatory)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus the WebEntity having lru_prefix as one of its LRU prefixes.

  • get_webentity_by_lruprefix_as_url:
  • url (mandatory)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus the WebEntity having one of its LRU prefixes corresponding to the LRU fiven under the form of a url.

  • get_webentity_for_url:
  • url (mandatory)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus the WebEntity to which a url belongs (meaning starting with one of the WebEntity's prefix and not another).

  • get_webentity_for_url_as_lru:
  • lru (mandatory)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus the WebEntity to which a url given under the form of a lru belongs (meaning starting with one of the WebEntity's prefix and not another).

  • get_webentities:
  • list_ids (optional, default: [])
  • sort (optional, default: null)
  • count (optional, default: 100)
  • page (optional, default: 0)
  • light (optional, default: false)
  • semilight (optional, default: false)
  • light_for_csv (optional, default: false)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus all existing WebEntities or only the WebEntities whose id is among list_ids. Results will be paginated with a total number of returned results of countandpagethe number of the desired page of results. Returns all results at once iflist_idsis provided orcount _ (optional, default: = -1 ; otherwise results will include metadata on the request including the total number of results and a tokento be reused to collect the other pages viaget_webentities_page.) Other possible options include:

  • order the results with sort by inputting a field or list of fields as named in the WebEntities returned objects; optionally prefix a sort field with a "-" to revert the sorting on it; for instance: ["-indegree", "name"] will order by maximum indegree first then by alphabetic order of names
  • set light or semilight or light_for_csv to "true" to collect lighter data with less WebEntities fields.
  • advanced_search_webentities:
  • allFieldsKeywords (optional, default: [])
  • fieldKeywords (optional, default: [])
  • sort (optional, default: null)
  • count (optional, default: 100)
  • page (optional, default: 0)
  • autoescape_query (optional, default: true)
  • light (optional, default: false)
  • semilight (optional, default: true)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus all WebEntities matching a specific search using the allFieldsKeywords and fieldKeywords arguments. Searched keywords will automatically be escaped: set autoescape_query to "false" to allow input of special Lucene queries. Returns all results at once if count _ (optional, default: = -1 ; otherwise results will be paginated with a total number of returned results of count and page the number of the desired page of results. Results will include metadata on the request including the total number of results and a token to be reused to collect the other pages via get_webentities_page.`)

  • allFieldsKeywords should be a string or list of strings to search in all textual fields of the WebEntities ("name"/"status"/"lruset"/"startpages"/...). For instance ["hyphe", "www"]
  • fieldKeywords should be a list of 2-elements arrays giving first the field to search into then the searched value or optionally for the field "indegree" an array of a minimum and maximum values to search into. For instance: [["name", "hyphe"], ["indegree", [3, 1000]]]
  • see description of sort light and semilight in get_webentities above.
  • exact_search_webentities:
  • query (mandatory)
  • field (optional, default: null)
  • sort (optional, default: null)
  • count (optional, default: 100)
  • page (optional, default: 0)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus all WebEntities having one textual field or optional specific field exactly equal to the value given as query. Searched query will automatically be escaped of Lucene special characters. Results are paginated and will include a token to be reused to collect the other pages via get_webentities_page: see advanced_search_webentities for explanations on sort count and page.

  • prefixed_search_webentities:
  • query (mandatory)
  • field (optional, default: null)
  • sort (optional, default: null)
  • count (optional, default: 100)
  • page (optional, default: 0)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus all WebEntities having one textual field or optional specific field beginning with the value given as query. Searched query will automatically be escaped of Lucene special characters. Results are paginated and will include a token to be reused to collect the other pages via get_webentities_page: see advanced_search_webentities for explanations on sort count and page.

  • postfixed_search_webentities:
  • query (mandatory)
  • field (optional, default: null)
  • sort (optional, default: null)
  • count (optional, default: 100)
  • page (optional, default: 0)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus all WebEntities having one textual field or optional specific field finishing with the value given as query. Searched query will automatically be escaped of Lucene special characters. Results are paginated and will include a token to be reused to collect the other pages via get_webentities_page: see advanced_search_webentities for explanations on sort count and page.

  • free_search_webentities:
  • query (mandatory)
  • field (optional, default: null)
  • sort (optional, default: null)
  • count (optional, default: 100)
  • page (optional, default: 0)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus all WebEntities having one textual field or optional specific field containing the value given as query. Searched query will automatically be escaped of Lucene special characters. Results are paginated and will include a token to be reused to collect the other pages via get_webentities_page: see advanced_search_webentities for explanations on sort count and page.

  • get_webentities_by_status:
  • status (mandatory)
  • sort (optional, default: null)
  • count (optional, default: 100)
  • page (optional, default: 0)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus all WebEntities having their status equal to status (one of "in"/"out"/"undecided"/"discovered"). Results are paginated and will include a token to be reused to collect the other pages via get_webentities_page: see advanced_search_webentities for explanations on sort count and page.

  • get_webentities_by_name:
  • name (mandatory)
  • sort (optional, default: null)
  • count (optional, default: 100)
  • page (optional, default: 0)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus all WebEntities having their name equal to name. Results are paginated and will include a token to be reused to collect the other pages via get_webentities_page: see advanced_search_webentities for explanations on sort count and page.

  • get_webentities_by_tag_value:
  • value (mandatory)
  • sort (optional, default: null)
  • count (optional, default: 100)
  • page (optional, default: 0)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus all WebEntities having at least one tag in any namespace/category equal to value. Results are paginated and will include a token to be reused to collect the other pages via get_webentities_page: see advanced_search_webentities for explanations on sort count and page.

  • get_webentities_by_tag_category:
  • category (mandatory)
  • sort (optional, default: null)
  • count (optional, default: 100)
  • page (optional, default: 0)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus all WebEntities having at least one tag in a specific category. Results are paginated and will include a token to be reused to collect the other pages via get_webentities_page: see advanced_search_webentities for explanations on sort count and page.

  • get_webentities_by_user_tag:
  • category (mandatory)
  • value (mandatory)
  • sort (optional, default: null)
  • count (optional, default: 100)
  • page (optional, default: 0)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus all WebEntities having at least one tag in any category of the namespace "USER" equal to value. Results are paginated and will include a token to be reused to collect the other pages via get_webentities_page: see advanced_search_webentities for explanations on sort count and page.

  • get_webentities_mistagged:
  • status (optional, default: 'IN')
  • missing_a_category (optional, default: false)
  • multiple_values (optional, default: false)
  • sort (optional, default: null)
  • count (optional, default: 100)
  • page (optional, default: 0)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus all WebEntities of status status with no tag of the namespace "USER" or multiple tags for some USER categories if multiple_values is true or no tag for at least one existing USER category if missing_a_category is true. Results are paginated and will include a token to be reused to collect the other pages via get_webentities_page: see advanced_search_webentities for explanations on sort count and page.

  • get_webentities_uncrawled:
  • sort (optional, default: null)
  • count (optional, default: 100)
  • page (optional, default: 0)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus all IN WebEntities which have no crawljob associated with it. Results are paginated and will include a token to be reused to collect the other pages via get_webentities_page: see advanced_search_webentities for explanations on sort count and page.

  • get_webentities_page:
  • pagination_token (mandatory)
  • n_page (mandatory)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus the page number n_page of WebEntities corresponding to the results of a previous query ran using any of the get_webentities or search_webentities methods using the returned pagination_token.

  • get_webentities_ranking_stats:
  • pagination_token (mandatory)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus histogram data on the indegrees of all WebEntities matching a previous query ran using any of the get_webentities or search_webentities methods using the return pagination_token.

TAGS

  • add_webentity_tag_value:
  • webentity_id (mandatory)
  • namespace (mandatory)
  • category (mandatory)
  • value (mandatory)
  • corpus (optional, default: "--hyphe--")

Adds for a corpus a tag namespace:category_ (optional, default: value to a WebEntity defined by webentity_id.`)

  • add_webentities_tag_value:
  • webentity_ids (mandatory)
  • namespace (mandatory)
  • category (mandatory)
  • value (mandatory)
  • corpus (optional, default: "--hyphe--")

Adds for a corpus a tag namespace:category_ (optional, default: value to a bunch of WebEntities defined by a list of webentity_ids.`)

  • rm_webentity_tag_key:
  • webentity_id (mandatory)
  • namespace (mandatory)
  • category (mandatory)
  • corpus (optional, default: "--hyphe--")

Removes for a corpus all tags within namespace:category associated with a WebEntity defined by webentity_id if it is set.

  • rm_webentity_tag_value:
  • webentity_id (mandatory)
  • namespace (mandatory)
  • category (mandatory)
  • value (mandatory)
  • corpus (optional, default: "--hyphe--")

Removes for a corpus a tag namespace:category_ (optional, default: value associated with a WebEntity defined by webentity_id if it is set.`)

  • set_webentity_tag_values:
  • webentity_id (mandatory)
  • namespace (mandatory)
  • category (mandatory)
  • values (mandatory)
  • corpus (optional, default: "--hyphe--")

Replaces for a corpus all existing tags of a WebEntity defined by webentity_id for a specific namespace and category by a list of values or a single tag.

  • get_tags:
  • namespace (optional, default: null)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus a tree of all existing tags of the webentities hierarchised by namespaces and categories. Optionally limits to a specific namespace.

  • get_tag_namespaces:
  • corpus (optional, default: "--hyphe--")

Returns for a corpus a list of all existing namespaces of the webentities tags.

  • get_tag_categories:
  • namespace (optional, default: null)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus a list of all existing categories of the webentities tags. Optionally limits to a specific namespace.

  • get_tag_values:
  • namespace (optional, default: null)
  • category (optional, default: null)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus a list of all existing values in the webentities tags. Optionally limits to a specific namespace and/or category.

PAGES

  • _`LINKS & NETWORKS
  • get_webentity_pages:
  • webentity_id (mandatory)
  • onlyCrawled (optional, default: true)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus all indexed Pages fitting within the WebEntity defined by webentity_id. Optionally limits the results to Pages which were actually crawled setting onlyCrawled to "true".

  • get_webentity_mostlinked_pages:
  • webentity_id (mandatory)
  • npages (optional, default: 20)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus the npages (defaults to 20) most linked Pages indexed that fit within the WebEntity defined by webentity_id.

  • get_webentity_subwebentities:
  • webentity_id (mandatory)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus all sub-webentities of a WebEntity defined by webentity_id (meaning webentities having at least one LRU prefix starting with one of the WebEntity's prefixes).

  • get_webentity_parentwebentities:
  • webentity_id (mandatory)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus all parent-webentities of a WebEntity defined by webentity_id (meaning webentities having at least one LRU prefix starting like one of the WebEntity's prefixes).

  • get_webentity_nodelinks_network:
  • webentity_id (optional, default: null)
  • include_external_links (optional, default: false)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus the list of all internal NodeLinks of a WebEntity defined by webentity_id. Optionally add external NodeLinks (the frontier) by setting include_external_links to "true".

  • get_webentities_network:
  • corpus (optional, default: "--hyphe--")

Returns for a corpus the list of all agregated weighted links between WebEntities.

CREATION RULES

  • get_default_webentity_creationrule:
  • corpus (optional, default: "--hyphe--")

Returns for a corpus the default WebEntityCreationRule.

  • get_webentity_creationrules:
  • lru_prefix (optional, default: null)
  • corpus (optional, default: "--hyphe--")

Returns for a corpus all existing WebEntityCreationRules or only one set for a specific lru_prefix.

  • delete_webentity_creationrule:
  • lru_prefix (mandatory)
  • corpus (optional, default: "--hyphe--")

Removes from a corpus an existing WebEntityCreationRule set for a specific lru_prefix.

  • add_webentity_creationrule:
  • lru_prefix (mandatory)
  • regexp (mandatory)
  • apply_to_existing_pages (optional, default: false)
  • corpus (optional, default: "--hyphe--")

Adds to a corpus a new WebEntityCreationRule set for a lru_prefix to a specific regexp or one of "subdomain"/"subdomain-N"/"domain"/"path-N"/"prefix N"/"page" N being an integer. Optionally set apply_to_existing_pages to "true" to apply it immediately to past crawls.

  • simulate_creationrules_for_urls:
  • pageURLs (mandatory)
  • corpus (optional, default: "--hyphe--")

Returns an object giving for each URL of pageURLs (single string or array) the prefix of the theoretical WebEntity the URL would be attached to within a corpus following its specific WebEntityCreationRules.

  • simulate_creationrules_for_lrus:
  • pageLRUs (mandatory)
  • corpus (optional, default: "--hyphe--")

Returns an object giving for each LRU of pageLRUs (single string or array) the prefix of the theoretical WebEntity the LRU would be attached to within a corpus following its specific WebEntityCreationRules.

PRECISION EXCEPTIONS

  • get_precision_exceptions:
  • corpus (optional, default: "--hyphe--")

Returns for a corpus the list of all existing PrecisionExceptions.

  • delete_precision_exceptions:
  • list_lru_exceptions (mandatory)
  • corpus (optional, default: "--hyphe--")

Removes from a corpus a set of existing PrecisionExceptions listed as list_lru_exceptions.

  • add_precision_exception:
  • lru_prefix (mandatory)
  • corpus (optional, default: "--hyphe--")

Adds to a corpus a new PrecisionException for lru_prefix.

VARIOUS

  • trigger_links_build:
  • corpus (optional, default: "--hyphe--")

Will initiate a links calculation update (useful especially when a corpus crashed during the links calculation and no more crawls is programmed).

  • trigger_links_reset:
  • corpus (optional, default: "--hyphe--")

Will initiate a whole reset and regeneration of all WebEntityLinks of a corpus. Can take a while.

  • get_webentities_stats:
  • corpus (optional, default: "--hyphe--")

Returns for a corpus a set of statistics on the WebEntities status repartition of a corpus each 5 minutes.