webLyzard Asset Upload API Specification 21 June 2025

Introduction

The Asset Upload API describes the programmatic interface for creating, retrieving, updating and deleting (CRUD) assets/documents with a given document repository (to be able to use the Document API, such a repository needs to be configured first). Assets/documents in such repositories are identified by a unique (numeric) <identifier> that is generated by the platform when adding new documents. Subsequent document retrievals, updates and deletions should refer to this identifier.

In addition to this repository-only functionality, the Asset Upload API also supports annotating existing documents using any of the supported webLyzard annotation tools. For document annotation no repository is required.

The Asset Upload API provides the following functionality:

Adding a document to a given repository
Updating an existing document in a given repository
Deleting an existing document from a given repository
Querying an existing document from a given repository
Annotating an existing document without a given repository

Usage of the Asset Upload API requires an access token. For further information on how to obtain an access token, please refer to end of this document

Adding Assets (Create)

Assets are always added to a specific repository, the data format has to adhere to the webLyzard Document specification. Adding a document will always result in

the creation of a new numeric identifier and therefore a new document in the repository, regardless if the URI already exists, and
the execution of all annotation steps as defined for the repository - an additional call to the Annotate API is therefore not required.

As parameters, the Asset Upload API expects content to be provided as either:

tokenized content, where
- sentences are provided
- no content is provided
a json string content (text/html, text/plain), where
- content and content_type are provided
- no sentence information is provided (free text or unstructured HTML)

If both plain text and tokenized content or neither are provided, a processing error (4xx) will be returned. For further information on valid document structures to be sent to the API, please refer to Appendix B: webLyzard Document Format.

To create a new document, send a POST request to the <repository> API endpoint, with the body of the request containing the document.

{
    "repository_id": "repository",
    "title": "document title",
    "uri": "the document's uri",
    "content": "Therefore we could show that \"x>y\" and \"y<z.\".",
    "content_type": "text/plain"
}

Listing 1: A minimal valid JSON document

curl -H “Authorization: Bearer <access_token>” -H “Content-Type: application/json” -d @document.json -XPOST https://api.weblyzard.com/1.0/documents/<repository>

Listing 2: Adding an metadata asset to a webLyzard repository via the Document API

If the asset has been successfully stored, the server responds with a “201 Created” status code and the “Location” header field contains the unique <identifier> created for this asset.

HTTP/1.1 201 Created Location: https://api.weblyzard.com/1.0/documents/<repository>/<identifier> Content-Type: application/json; charset=UTF-8 Content-Length: ... {“created”:true, “_id”:”<identifier>”}

Listing 3: REST response from the Asset Upload API (asset add)

In case of error, the server will return one of multiple error codes and a description of the error. A 4xx error will be returned in case the request is malformed (e.g. “400 Bad Request”) or the user does not have the appropriate access rights (e.g. “403 Forbidden”). A 5xx error will be returned if processing on the server failed.

Retrieving Assets

The most recent version of a asset can be retrieved by sending a GET request to the <identifier> of the asset:

curl -H “Authorization: Bearer <access_token>” -XGET ‘https://api.weblyzard.com/1.0/documents/<repository>/<identifier>’

Listing 4: Retrieving an asset from a webLyzard repository via the Asset Upload API

The server responds with a “200 OK” status code and the JSON representation of the document (as specified by the webLyzard document specification):

HTTP/1.1 200 OK Content-Type: application/json; charset=UTF-8 Content-Length: …  {“_id”:”<identifier>”, “repository”:”<repository>”, “title”:...}

Listing 5: REST response from the Asset Upload API (asset retrieval)

If the asset does not exist, the server returns a “404 Not Found” response.

Updating Assets

Assets can be updated (overwritten) with a newer version of the same asset.

curl -H “Authorization: Bearer <access_token>” -H “Content-Type: application/json” -d @document.json -XPUT https://api.weblyzard.com/1.0/documents/<repository>/<identifier>

Listing 6: Updating an asset in a webLyzard repository via the Asset Upload API

For further information on valid document structure to be sent to the Asset Upload API, please refer to the end of this documentation). On success, the server responds with a “200 OK” status code:

HTTP/1.1 200 OK Content-Type: application/json; charset=UTF-8 Content-Length: …  {“created”:false,”updated”:true,”_id”:”<identifier>”}

Listing 7: REST response from the Asset Upload API (asset update)
If there is no asset referenced by <identifier> available, the server will respond with a “400 Bad Request” error and the document should be added using the syntax for adding assets; i.e., <identifier> is always created by the server and cannot be set arbitrarily by the client.

Deleting Assets

Assets can be deleted by issuing a DELETE request on the identifier of the asset:

curl -H “Authorization: Bearer <access_token>” -XDELETE ‘https://api.weblyzard.com/1.0/documents/<repository>/<identifier>’

Listing 8: Deleting a document from a webLyzard repository via the Asset Upload API

On success, the server responds with a “200 OK” status code:

HTTP/1.1 200 OK Content-Type: application/json; charset=UTF-8 Content-Length: …  {“deleted”:true,”_id”:”<identifier>”}

Listing 9: REST response from the Asset Upload API (asset delete)

Annotating Assets

Instead of pushing assets into a webLyzard repository, users may also run asset annotations without permanently storing the assets in a repository.

The Asset Upload API currently supports the following document annotations:

sentiment, extracts document and sentence level polarity from a document (also runs sentence tokenization and POS tagging as a prerequisite, if not provided by the user).
namedentities, identifies named entities in a document using our Named Entity Recognition (NER) tool, Recognyze.
pos, extract part-of-speech and sentence information from free text.
keywords, extract top 10 keywords from a free text.
summary, extract a summary of the top 3 significant sentences from a free text.

For information on valid document structures to be sent to the annotation service, please refer to Appendix B: webLyzard Document Format.

curl -H “Authorization: Bearer <access_token>” -H “Content-Type: application/json” -d @document.json -XPOST https://api.weblyzard.com/1.0/annotate

Listing 10: Annotating an asset via the Asset Upload API

If successful, the server responds with a “200 OK” response code and returns the annotated document:

HTTP/1.1 200 OK Content-Type: application/json; charset=UTF-8 Content-Length: … {...}

Listing 11: REST response from the Annotate API (document annotate)

To add specific annotations only, the annotation type can be included in the request:

curl -H “Authorization: Bearer <access_token>” -d @document.json -XPOST http://api.weblyzard.com/1.0/annotate/sentiment

curl -H “Authorization: Bearer <access_token>” -d @document.json -XPOST http://api.weblyzard.com/1.0/annotate/sentiment+namedentities

Listing 12: Annotating a document via the Annotate API with different workflows

Appendix A: Authentication | Authorization

Authentication and authorization is handled using JSON Web Tokens (JWT). Until tokens are issued using the global webLyzard login server, new tokens can be obtained using the /token API endpoint using Basic Authentication.

A token is valid for 8 hours, after which the token will be rejected by the API and a new token must be generated.

Obtaining a New Token

To obtain a new token, do a GET request to the /token endpoint:

curl -i -u <user>:<pass> https://api.weblyzard.com/1.0/token

The server responds with the issued token for the user:

HTTP/1.1 200 OK Date: Tue, 17 Nov 2015 12:43:10 GMT Server: Apache/2.4.7 (Ubuntu) Content-Length: 626 Connection: close eyJ0eXAiOiJKV2QiLCJhbGciOiJIUzUxMiJ9.ey4wZXJtaX...

Calling API Methods Using the Obtained Token

All API calls must be authenticated using a valid token (see above). Pass the token using the “Authorization: Bearer” request header:

curl -i -H “Authorization: Bearer eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzUxMiJ9.eyJwZXJtaX..." https://api.weblyzard.com/1.0/documents/test/12345

Appendix B: webLyzard Document Format

The webLyzard document format consists of three related object structures: the Document, the Sentence, and the Annotation. These three data structures strongly depend on each other, but only Document is essential.

The Document structure models a single document to be uploaded. It provides the system with the basic information required to process this single document. Fully tokenized and annotated documents can be provided via the Annotation (surface form annotations such as Named Entity and target sentiment) and the Sentence (tokenization, part-of-speech, sentiment) formats as documented below.

Accepted document encoding is limited to UTF-8.

Document
Required Fields
uri	string the unique identifier of the document, e.g. a URL. Must be a valid URI.
title	string a title string
Optional Fields
repository_id	string a unique ID consisting of a repository name + fqdn of the content provider, source or project - for example: mobile.asap-fp7.eu, media.ecoresearch.net, or social.weblyzard.com.
language_id	string ISO language identifier. Supported are [‘en’,’es’,’fr’,’de’]
content_type	string only required if content is provided, specifies how the content should be interpreted, supported are text/html, text/plain
content	variable the document content as json string, with content_type specifying the respective content format. If content is provided, then sentences must not be provided. Providing both content and sentences will result in an error. Providing content without content_type will result in an error.
sentences	list an ordered list of tokenized sentence objects. If sentences are not provided, content and content_type must be provided.
annotations	list a list object of webLyzard annotation. Currently supported by the visualization components are sentiment and named_entity types, but may be used also for other annotation types (such as rumours, opinion targets, etc).
meta_data	dict a dictionary describing arbitrary document metadata that provides additional document-level information, for example: author the author of the document published_date a date string determining when a document was published (and therefore when it will be visible in the portal). If no published_date is provided, we will try to extract one from the content. If this fails, the submission date of the document is used as the published_date polarity document-level sentiment polarity document_linkage 1) ‘refers-to’, n-to-n, for retweets, quotes, etc… 2) ‘child-of’, 1-to-1, to model nested conversations (threaded dialogues) 3) ‘part-of’ , n-to-n, document belongs to e.g. story cluster / other collection
features	dict A dictionary describing arbitrary data as key value-pairs, which complements the more well-defined meta_data field. These key value-pairs are disregarded in the visual analytics dashboard, unless custom frontend functions were developed to process them.
relations	dict A dictionary describing arbitrary document-to-document relations as key-value pairs. Document relations do not have any impact in the portal unless explicitly requested.

Document Examples

{
    "repository_id": "media.ecoresearch.net",
    "uri": "http: //www.bbc.com/news/science-environment-33524589",
    "content_type": "text/plain",
    "title": "New Horizons: Nasa spacecraft speeds past Pluto",
    "content": " Nasa’s spacecraft speeds past Pluto",
    "meta_data": {
        "author": "Jonathan Amos"
    }
}

{
    "repository_id": "media.ecoresearch.net",
    "uri": "http: //www.bbc.com/news/science-environment-33524589",
    "title": "New Horizons: Nasa’s spacecraft speeds past Pluto",
    "sentences": [
        {
            "id": "595f44fec1e92a71d3e9e77456ba80d1",
            "value": "New Horizons: Nasa’s spacecraft speeds past Pluto",
            "is_title": "TRUE",
            "pos_list": "NN NN : ' NN : NN CC JJ NN . '",
            "tok_list": "0,2 3,19 19,20 21,22 22,33 33,34 35,42 43,46 47,55 56,62 62,63 63,64",
            "sentence_number": 0,
            "polarity": -0.783
        }
    ],
    "annotations": [
        {
            "start": 12,
            "end": 16,
            "sentence": "595f44fec1e92a71d3e9e77456ba80d1",
            "surface_form": "Nasas",
            "annotation_type": "OrganizationEntity",
            "key": "http://dbpedia.org/page/Nasa"
        },
        {
            "start": 40,
            "end": 44,
            "sentence": "595f44fec1e92a71d3e9e77456ba80d1",
            "surface_form": "Pluto",
            "key": "http://dbpedia.org/page/Pluto",
            "annotation_type": "GeoEntity"
        }
    ],
    "meta_data": {
        "polarity": "0.342",
        "published_date": "2015-07-14"
    }
}

Sentence
Required Fields
id	string the unique identifier of the sentence, i.e. a sentence hash (md5)
value	string the sentence text
pos_list	string a whitespace separated list of part-of-speech tags (POS), one per token. Currently supported POS by language are specified at http://weblyzard-api.readthedocs.org/en/latest/weblyzard_api.data_format.pos-tags.html
tok_list	string a whitespace separated list of sentence tokens (words), encoded as space-separated string of comma-separated sentence offset tuples: start_offset,end_offset, e.g. “0,2 3,19”
Optional Fields
is_title	boolean is the sentence part of the title, defaults to False If the document-level attribute “title” is also set, the value of this sentence must match that attribute.
dep_tree	string a whitespace separated list of pointers to the parent of a token in the dependency tree. -1 denotes the root node.
sentence_number	int 0-based sentence sequence number, e.g. the index of a sentence in the list of all document sentences
paragraph_number	int 0-based paragraph sequence number, e.g. the index of a paragraph in the list of all document paragraphs
polarity	float sentence-level sentiment polarity as floating point in range [-0..1]
polarity_class	string sentence-level sentiment polarity class, with possible values [‘positive’, ‘negative’, ‘neutral’]

Sentence Examples

{
    "id": "595f44fec1e92a71d3e9e77456ba80d1",
    "value": "New Horizons: Nasa’s spacecraft speeds past Pluto.",
    "is_title": false,
    "pos_list": "NNP NNP : NNP POS NN NNS IN NNP .",
    "tok_list": "0,3 4,12 12,13 14,18 18,20 21,31 32,38 39,43 44,49 49,50",
    "sentence_number": 0,
    "polarity": -0.783,
    "polarity_class": "negative"
}

{
    "id": "595f44fec1e92a71d3e9e77456ba80d1",
    "value": "New Horizons: Nasa’s spacecraft speeds past Pluto."
}

Annotation
Required Fields
start	int the start offset of the annotation, relative to the absolute document content (not tokenized).
end	int the end offset of the annotation, relative to the absolute document content (not tokenized).
surface_form	string the surface form of the annotation (e.g. how the annotation actually appears in the document)
annotation_type	string the type of the annotation. Supported by the visualization components are Sentiment, GeoEntity, PersonEntity, OrganizationEntity. Arbitrary other sentence-level annotations are allowed, but not currently supported by the visualization components.
Optional Fields
key	string reference key, e.g. Linked Open Data (LOD)
sentence_id	string the id of the sentence object to which the annotation’s start and end positions refer to. If no sentence id is specified, the annotation positions are applied on the document level.
display_name	string searchable field in the portal
polarity	float sentence-level sentiment polarity as floating point in range [-0..1]
polarity_class	string document-level sentiment polarity, with possible values [‘positive’, ‘negative’, ‘neutral’]. Requires annotation_type to be sentiment.
properties	dict a dictionary of additional properties associated with the annotation. The expected key value tuples in the properties depend on the type of the entity defined on the webLyzard document level (e.g. the key to the annotation list). Supported properties by the portal are: lat, long, population, birth_date, abstract

Annotation Examples

{
    "start": 87,
    "end": 92,
    "sentence": "595f44fec1e92a71d3e9e77456ba80d1",
    "surface_form": "Apple",
    "key": "http://dbpedia.org/page/Apple_Inc",
    "annotation_type": "OrganizationEntity",
    "display_name": "Apple Incorporated",
    "properties": {
        "founders": "Steve Jobs,Steve Wozniak,Ronald Wayne"
    }
}

{
    "start": 12,
    "end": 14,
    "sentence": "595f44fec1e92a71d3e9e77456ba80d1",
    "surface_form": "USA",
    "key": "http://dbpedia.org/page/United_States",
    "annotation_type": "GeoEntity",
    "display_name": "U.S.A",
    "properties": {
        "population": "318.900.000",
        "lat": "100.0",
        "long": "30.0"
    }
}

{
    "start": 12,
    "end": 14,
    "sentence": "595f44fec1e92a71d3e9e77456ba80d1",
    "surface_form": "USA",
    "polarity": 0.655,
    "annotation_type": "Sentiment"
}

{
    "start": 12,
    "end": 14,
    "surface_form": "USA",
    "annotation_type": "GeoEntity"
}