FB4D Reference IVisionML

Interface IVisionML

This interface provides two functions for image analysis based on the Cloud Vision API. While the methode AnnotateFile expects the image file in TIF or GIF format in Base64 notation as first parameter, the second methode AnnotateStorage expects an object reference in the storage to an image file in the same formats. In addition to image analyse, the same two methods are able to anaylse documents in the PDF format.

IVisionML = interface(IInterface)
  function AnnotateFileSynchronous(const FileAsBase64,
    ContentType: string; Features: TVisionMLFeatures;
    MaxResultsPerFeature: integer = 50;
    Model: TVisionModel = vmStable): IVisionMLResponse;
  procedure AnnotateFile(const FileAsBase64,
    ContentType: string; Features: TVisionMLFeatures;
    OnAnnotate: TOnAnnotate; OnAnnotateError: TOnRequestError;
    const RequestID: string = ''; MaxResultsPerFeature: integer = 50;
    Model: TVisionModel = vmStable);

  function AnnotateStorageSynchronous(const RefStorageCloudURI,
    ContentType: string; Features: TVisionMLFeatures;
    MaxResultsPerFeature: integer = 50;
    Model: TVisionModel = vmStable): IVisionMLResponse;
  procedure AnnotateStorage(const RefStorageCloudURI,
    ContentType: string; Features: TVisionMLFeatures;
    OnAnnotate: TOnAnnotate; OnAnnotateError: TOnRequestError;
    MaxResultsPerFeature: integer = 50;
    Model: TVisionModel = vmStable);
end;

In the parameter ContentType, the file type is passed as an HTTP standardized value ('image/gif', 'image/tiff', or 'application/pdf').

The parameter Features contains a set of the following features to process the image or the document:

TVisionMLFeature = (vmlUnspecific, vmlFaceDetection, vmlLandmarkDetection,
  vmlLogoDetection, vmlLabelDetection, vmlTextDetection, vmlDocTextDetection,
  vmlSafeSearchDetection, vmlImageProperties, vmlCropHints, vmlWebDetection,
  vmlProductSearch, vmlObjectLocalization);

The optional parameter MaxResultsPerFeature limits the number of results for each selected feature.

The optional parameter Model allow to choose between the official (stable) model or the latest beta model version.

As result both methods returns an object that provides the interface IVisionMLResponse:

IVisionMLResponse = interface(IInterface)
  function GetFormatedJSON: string;
  function GetNoPages: integer;
  function GetPageAsFormatedJSON(PageNo: integer = 0): string;
  function GetError(PageNo: integer = 0): TErrorStatus;
  function LabelAnnotations(PageNo: integer = 0): TAnnotationList;
  function LandmarkAnnotations(PageNo: integer = 0): TAnnotationList;
  function LogoAnnotations(PageNo: integer = 0): TAnnotationList;
  function TextAnnotations(PageNo: integer = 0): TAnnotationList;
  function FullTextAnnotations(PageNo: integer = 0): TTextAnnotation;
  function ImagePropAnnotation(
    PageNo: integer = 0): TImagePropertiesAnnotation;
  function CropHintsAnnotation(PageNo: integer = 0): TCropHintsAnnotation;
  function WebDetection(PageNo: integer = 0): TWebDetection;
  function SafeSearchAnnotation(PageNo: integer = 0): TSafeSearchAnnotation;
  function FaceAnnotation(PageNo: integer = 0): TFaceAnnotationList;
  function LocalizedObjectAnnotation(
    PageNo: integer = 0): TLocalizedObjectList;
  function ProductSearchAnnotation(
    PageNo: integer = 0): TProductSearchAnnotation;
  function ImageAnnotationContext(
    PageNo: integer = 0): TImageAnnotationContext;
end;

The first getter method GetFormatedJSON returns the JSON Object returned from the Vision API in its original form. Since this JSON has a high nesting depth and the interpretation is complex, the IVisionMLResponse interface provides easy-to-use getter methods with records for all data structures returned by VisionML.

For PDF documents which, unlike an image, may contain more than one page, the GetNoPages function returns the number of pages. In all getter methods for feature-specific results the number of pages for analyzed PDF documents must be specified in the range between 0 and NoPages - 1.

Label Annotation List

The method IVisionMLResponse.LabelAnnotations returns a list of labels when the feature vmlLabelDetection was selected for Annotate.

TEntityAnnotation = record
  Id: string;
  Locale: string;
  Description: string;
  Score, Topicality, Confidence: double;
  BoundingBox: TBoundingPoly;
  Locations: array of TLocationInfo;
  Properties: array of TProperty;
  procedure AddStrings(s: TStrings; Indent: integer);
end;

TAnnotationList = array of TEntityAnnotation;  end;

The Id string can contain a machine-generated identifier (MID) corresponding to the entity's Google Knowledge Graph entry. Note that mid values remain unique across different languages, so you can use these values to tie entities together from different languages. To inspect MID values, refer to the Google Knowledge Graph API documentation.

The Description string contains a description of the assoicated label in English.

The Score number is used for the confidence value, which ranges from 0 (no confidence) to 1 (very high confidence). This number is a measure of the detection quality. In simple terms, this value means how sure the machine learning engine is that this label is correct.

The Topicality value is a measure for the relevancy of the label within the entire image. It measures how important or how central a label is to the overall context of a page.

The methode AddStrings can be used to All other values of TEntityAnnotation are empty for label detection and are used for other features.

Localized Object Annotation List

The Vision API can detect and extract multiple objects in an image. For this purpose use the feature vmlObjectLocalization and call the method IVisionMLResponse.LocalizedObjectAnnotation to receive a list of the following records:

TLocalizedObjectAnnotation = record
  Id: string;
  LanguageCode: string;
  Name: string;
  Score: double;
  BoundingPoly: TBoundingPoly;
  procedure AddStrings(s: TStrings; Indent: integer);
end;

Object localization identifies multiple objects in an image and provides localized object annotations for each object in the image. Each LocalizedObjectAnnotation identifies information about the object, the position of the object, and rectangular bounds for the region of the image that contains the object.

The Name describes the object in english. The Id contains a machine-generated identifier (MID) corresponding to the entity's Google Knowledge Graph entry.

TVertex = record
  x: integer;
  y: integer;
end;

TNormalizedVertex = record
  x: double;
  y: double;
end;  

TBoundingPoly = record  
  Vertices: array of TVertex;
  NormalizedVertices: array of TNormalizedVertex;
end;

The BoundingPoly contains for this call normalized vertices only. Therefore, the coordinates are normalized (0..1) to the width and height of the image.

Full Text Annotation List

The Vision API can detect and extract text from images. There are two annotation features that support optical character recognition (OCR):

The feature vmlTextDetection detects and extracts text from any image. For example, a photograph might contain a street sign or traffic sign. The JSON includes the entire extracted string, as well as individual words, and their bounding boxes.
The feature vmlDocTextDetection also extracts text from an image, but the response is optimized for dense text and documents.

Text analysis divides the text in four levels: Blocks, Paragraph, Word and Symbol (single character or glyph).

For both features, only the method function IVisionMLResponse.FullTextAnnotations(PageNo: integer = 0): TTextAnnotation; delivers result. During the development of the FB4D wrapper, the cloud never returned any results in function IVisionMLResponse.TextAnnotations(PageNo: integer = 0): TAnnotationList. With the function vmlDocTextDetection in comparison to vmlTextDetection more information is returned within the structure TTextAnnotation. So the confidence values are returned down to the level of the symbol.

TTextAnnotation = record
  Text: string;
  EncodedText: string;
  TextPages: array of TTextPages;
  function GetText(MinConfidence: double): string;
end;

The record TTextAnnotation essentially contains an array for the pages TTextPages. Beside this the whole detected text is returned by ML Vision, where FB4D provides it in the original version Text and in the encoded version EncodedText. Additionally, TTextAnnotation.GetText(MinConfidence: double) allows to recursively search for all texts that contain a confidence value greater than or equal to the passed minimum value.

TTextPages = record
  TextProperty: TTextProperty;
  Width, Height: integer;
  Blocks: array of TBlock;
  Confidence: double;
end;

The record TTextPages essentially contains an array for text blocks TBlock. Beside of this it returns width and height for PDF only. TextProperty contains information about the detected language and their confidence walue. The helper function TFirebaseHelpers.GetLanguageInEnglishFromCode converts the LanguageCode into the english name of the language. See this docuement which language is supported: (cloud.google.com/vision/docs/languages)[https://cloud.google.com/vision/docs/languages].

TTextProperty = record
  DetectedLanguages: array of TDetectedLanguage;
  DetectedBreakType: TDetectedBreakType;
  DetectedBreakIsPrefix: boolean;
end;

TDetectedLanguage = record
  LanguageCode: string;
  Confidence: double;
end;

TDetectedBreakType = (dbtUnknown, dbtSpace, dbtSureSpace, dbtEOLSureSpace, dtbHyphen, dbtLineBreak);

The DetectedBreakType informs about the type of block separation:

EolSureSpace: Line-wrapping break.
Hyphen: End-line hyphen that is not present in text; does not co-occur with SPACE, LEADER_SPACE, or LINE_BREAK.
LineBreak: Line break that ends a paragraph.
Space: Regular space.
SureSpace: Sure space (very wide).
Unknown: Unknown break label type.

The BlockType informs about the detected block kind.

TBlockType = (btUnkown, btText, btTable, btPicture, btRuler, btBarcode);

Barcode: Barcode block.
Picture: Image block.
Ruler: Horizontal/vertical line box.
Table: Table block.
Text: Regular text block.
Unknown: Unknown block type.

The record TBlock essentially contains an array for text blocks TParagraph.

TBlock = record
  TextProperty: TTextProperty;
  BoundingBox: TBoundingPoly;
  Paragraphs: array of TParagraph;
  BlockType: TBlockType;
  Confidence: double;
end;

The record TParagraph essentially contains an array for text blocks TWord.

TParagraph = record
  TextProperty: TTextProperty;
  BoundingBox: TBoundingPoly;
  Words: array of TWord;
  Confidence: double;
end;

The record TWord essentially contains an array for text blocks TSymbols.

TWord = record
  TextProperty: TTextProperty;
  BoundingBox: TBoundingPoly;
  Symbols: array of TSymbols;
  Confidence: double;
end;

The record TSymbols corresponds to the bottom level of text analysis. The character/glyph is stored in the Text as string.

TSymbols = record
  TextProperty: TTextProperty;
  BoundingBox: TBoundingPoly;
  Text: string;
  Confidence: double;
end;

Landmark Annotation List

With the feature vmlLandmarkDetection the VisionML detects popular natural and human-made structures within an image.

The function IVisionMLResponse.LandmarkAnnotations(PageNo: integer = 0): TAnnotationList returns a list of landmarks wheres in the resulting record TEntityAnnotation the variable Locations contains an array of TLocationInfo:

TLocationInfo = record
  Latitude, Longitude: Extended;
end;

Logo Annotation List

With the feature vmlLogoDetection the VisionML detects popular logos of widely used brands.

The function IVisionMLResponse.LogoAnnotations returns a list of TEntityAnnotation

Face Annotation List

With the feature vmlFaceDetection The VisionML recognizes all faces of persons in the image. Thereby the faces are measured and simple emotions are checked. It is important to note that this function does not yet enable the identification of persons. It also does not return estimated values about age and gender.

The function IVisionMLResponse.FaceAnnotation returns a list of TFaceAnnotation.

TFaceAnnotation = record
  BoundingPoly: TBoundingPoly;
  FaceDetectionBP: TBoundingPoly;
  FaceLandmarks: array of TFaceLandmark;
  RollAngle: double; // -180..180
  PanAngle: double; // -180..180
  TiltAngle: double; // -180..180
  DetectionConfidence: double; // 0..1
  LandmarkingConfidence: double; // 0..1
  JoyLikelihood: TLikelihood; 
  SorrowLikelihood: TLikelihood;
  AngerLikelihood: TLikelihood;
  SurpriseLikelihood: TLikelihood;
  UnderExposedLikelihood: TLikelihood;
  BlurredLikelihood: TLikelihood;
  HeadwearLikelihood: TLikelihood;
end;

TLikelihood = (lhUnknown, lhVeryUnlikely, lhUnlikely, lhPossible, lhLikely, lhVeryLikely);

TFaceLandmark = record
  FaceLandmarkType: TFaceLandmarkType;
  Position: TFacePosition;
end;

TFaceLandmarkType = (flmUnkown, flmLeftEye, flmRightEye,
  flmLeftOfLeftEyeBrow, flmRightOfLeftEyeBrow,
  flmLeftOfRightEyeBrow, flmRightOfRightEyeBrow,
  flmMidpointBetweenEyes, flmNoseTip, flmUpperLip, flmLowerLip,
  flmMouthLeft, flmMouthRight, flmMouthCenter,
  flmNoseBottomRight, flmNoseBottomLeft, flmNoseBottomCenter,
  flmLeftEyeTopBoundary, flmLeftEyeRightCorner,
  flmLeftEyeBottomBoundary, flmLeftEyeLeftCorner,
  flmRightEyeTopBoundary, flmRightEyeRightCorner,
  flmRightEyeBottomBoundary, flmRightEyeLeftCorner,
  flmLeftEyebrowUpperMidpoint, flmRightEyebrowUpperMidpoint,
  flmLeftEarTragion, flmRightEarTragion,
  flmLeftEyePupil, flmRightEyePupil,
  flmForeheadGlabella, flmChinGnathion,
  flmChinLeftGonion, flmChinRightGonion,
  flmLeftCheekCenter, flmRightCheekCenter);

Web Detection

TWebDetection = record
  WebEntities: array of TWebEntity;
  FullMatchingImages: array of TWebImage;
  PartialMatchingImages: array of TWebImage;
  PagesWithMatchingImages: array of TWebPage;
  VisuallySimilarImages: array of TWebImage;
  BestGuessLabels: array of TWebLabel;
end;

TWebPage = record
  URL: string;
  Score: Double;
  PageTitle: string;
  FullMatchingImages: array of TWebImage;
  PartialMatchingImages: array of TWebImage;
end;

TWebLabel = record
  LabelText: string;
  LanguageCode: string;
end;

TWebImage = record
  URL: string;
  Score: Double;
end;

Safe Search Annotation

TSafeSearchAnnotation = record
  AdultContent: TLikelihood;
  Spoof: TLikelihood;
  MedicalImage: TLikelihood;
  ViolentContent: TLikelihood;
  RacyContent: TLikelihood;
end;

Product Search Annotation

TProductSearchAnnotation = record
  IndexTime: TDateTime;
  ProductResults: array of TProductResult;
  GroupedProductResults: array of TGroupedProductResult;
end;

TProductResult = record
  Product: TProduct;
  Score: double;
  Image: string;
end;

TKeyValue = record
  Key: string;
  Value: string;
end;

TProduct = record
  Name: string;
  DisplayName: string;
  Description: string;
  ProductCategory: string;
  ProductLabels: array of TKeyValue;
end;

TGroupedProductResult = record
  BoundingPoly: TBoundingPoly;
  ProductResults: array of TProductResult;
  ObjectAnnotations: array of TObjectAnnotation;
end;

TObjectAnnotation = record
  Id: string;
  LanguageCode: string;
  Name: string;
  Score: double;
end;

Image Annotation Context

TImageAnnotationContext = record
  URI: string;
  PageNumber: integer;
end;

Image Prop Annotation

TImagePropertiesAnnotation = record
  DominantColors: array of TColorInfo;
end;

Crop Hints Annotation

TCropHintsAnnotation = record
  CropHints: array of TCropHints; 
end;

TCropHint = record
  BoundingPoly: TBoundingPoly;
  Confidence: double;
  ImportanceFraction: double;
end;

Have you discovered an error? Or is something unclear? Please let us know in the discussion forum.

Schneider Infosystems Ltd. CH-6340 Baar, Switzerland, www.schneider-infosys.ch