Hexis API is a RESTful API that provides access to language analysis tools built at our lab. At the moment this includes a high-performance text classification model for offensive language detection.
Hexis API is designed to be straightforward to use. You can use the programming language of your choice to to interact with the API. For testing, a tool like Postman, Insomnia, or cURL on the command line, can be useful.
This API uses a base path of
https://api.hexis.ai. It is only available via a SSL-secured HTTP connection.
To work with Hexis API, you can either use standard network libraries for the programming language of your choice, or use the Swagger Generator 3 tool to auto-generate client libraries for a variety of languages. However, we do not provide guarantees that these autogenerated libraries are compatible with Hexis API out of the box. To get started with client auto-generation, please find our OpenAPI 3.0 specification here.
The HTTP content type is
application/json and the payloads exchanged between a client and the API endpoint are valid JSON objects. Things to note in the
text content: Double quotes (
") need to be escaped (
\") and line breaks should be replaced with
The Score is an indication of probability, not severity. Higher numbers represent a higher likelihood that the patterns in the text resemble patterns in comments that people have tagged as offensive. Scores are intended to let developers pick thresholds and automatically accept, review or reject text messages based on those threshold. Although the numbers are not a score of how offensive a particular comment is, the threshold can be set according to the use case at hand.
For a high-recall use case where it is favorable to deal with false positives instead of false negatives, one can choose a point around 0.5 or above. While it is possible to mistakenly flag a message that is only mildly toxic (ideally there is a manual review process in place), this setting will catch the less salient, implicit cases.
On the other hand, there are high-precision use cases where the priority is to automatically filter only definitive cases of offensive language. Here it's safe to choose a point around 0.9 or above.
As a simple heuristic: With manual moderation in place, a sensible threshold value would be 0.5. Alternatively, just the raw classification scores can be used to rank items for review. It is also possible to operate the system without manual moderation. Using a threshold value of 0.9 (and be potentially overblocking in a few cases) or 0.99 (and be potentially underblocking in a few cases).
Currently, the API allows for a maximum of 10 requests per second by default. Once this limit is reached, it will start returning errors with an HTTP status code of 429. The maximum request body size is 10 kilobytes. Requests larger than this trigger a status code of 413. Please note that the maximum input size for the classification model is 120 words. Any request larger than this will be split into 120-word sized parts and counted seperately. If service for a given account is temporarily suspended (e.g. no trial credits left and no payment information on file), a status code of 401 will be issued.
In practice, one often does not need to think about the exact timing in between requests, as small bursts of excessive requests are still being processed by the endpoint. The exception to this is bulk processing, where a 100ms pause in between requests should be implemented.
In case you want to build a custom model with us, this is the specification for the training data.
The annotation scheme largely follows the guidelines of the GermEval Shared Task on the Identification of Offensive Language as described e.g. in the Overview of the GermEval 2018 Shared Task on the Identification of Offensive Language. The primary task is Task 1: Coarse-grained Binary Classification, where postings are labeled as either
OTHER. Optionally, it is possible to also annotate for Task 2: Fine-grained 4-way Classification, where postings are labeled as either
OTHER. We extend the second task by another class, namely
The data should be placed in a UTF-8 encoded plain text file. One training example per line, in the following format:
<TEXT> tab <LABEL-TASK-I> tab <LABEL-TASK-II>
Line breaks in the text should be replaced with
Language code. Possible values are
API version. Possible values are
List of scores
Request Entity Too Large
Too Many Requests
- "text": "string"
- "scores": [