VGSL 規格 - 用於圖像的混合卷積/LSTM 網路快速原型設計

可變大小圖形規格語言 (VGSL) 能夠從非常簡短的定義字串，指定一個由卷積和 LSTM 組成的神經網路，該網路可以處理可變大小的圖像。

應用：VGSL 規格適用於什麼？

VGSL 規格專門設計用於創建以下網路：

輸入為可變大小的圖像。（在一個或兩個維度中！）
輸出圖像（熱圖）、序列（如文字）或類別。
卷積和 LSTM 是主要的計算組件。
固定大小的圖像也沒問題！

模型字串輸入和輸出

神經網路模型由一個字串描述，該字串描述輸入規格、輸出規格以及兩者之間的層規格。範例：

[1,0,0,3 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]

前 4 個數字指定輸入的大小和類型，並遵循 TensorFlow 對於圖像張量的約定：[批次、高度、寬度、深度]。目前忽略批次，但最終可能會用於表示訓練迷你批次大小。高度和/或寬度可能為零，允許它們可變。高度和/或寬度的非零值表示期望所有輸入圖像都是該大小，並且如果需要，將會彎曲以符合。深度對於灰階需要為 1，對於彩色需要為 3。作為一個特殊情況，不同的深度值和高度為 1 會導致圖像從輸入被視為垂直像素條的序列。請注意，始終以來，x 和 y 都與傳統數學相反，以使用與 TensorFlow 相同的約定。TF 採用此約定的原因是消除了輸入時轉置圖像的需求，因為圖像中相鄰的記憶體位置會先增加 x，然後增加 y，而 TF 中張量和 Tesseract 中的 NetworkIO 中相鄰的記憶體位置會先增加最右邊的索引，然後是左邊的索引，就像 C 陣列一樣。

最後一個「單詞」是輸出規格，其形式為

O(2|1|0)(l|s|c)n output layer with n classes.
  2 (heatmap) Output is a 2-d vector map of the input (possibly at
    different scale). (Not yet supported.)
  1 (sequence) Output is a 1-d sequence of vector values.
  0 (category) Output is a 0-d single vector value.
  l uses a logistic non-linearity on the output, allowing multiple
    hot elements in any output vector value. (Not yet supported.)
  s uses a softmax non-linearity, with one-hot output in each value.
  c uses a softmax with CTC. Can only be used with s (sequence).
  NOTE Only O1s and O1c are currently supported.

類別數量被忽略（僅為了與 TensorFlow 相容），因為實際數量取自 unicharset。

中間層的語法

請注意，所有操作都會輸入和輸出標準 TF 約定的 4 維張量：[批次, 高度, 寬度, 深度]，無論維度如何塌陷。這極大地簡化了事情，並允許 VGSLSpecs 類別追蹤寬度和高度值的變化，以便將其正確傳遞給 LSTM 操作，並供任何下游 CTC 操作使用。

注意：在以下描述中，<d> 是一個數值，而文字則使用正規表示式語法描述。

注意：允許在操作之間使用空白。

功能操作

C(s|t|r|l|m)<y>,<x>,<d> Convolves using a y,x window, with no shrinkage,
  random infill, d outputs, with s|t|r|l|m non-linear layer.
F(s|t|r|l|m)<d> Fully-connected with s|t|r|l|m non-linearity and d outputs.
  Reduces height, width to 1. Connects to every y,x,depth position of the input,
  reducing height, width to 1, producing a single <d> vector as the output.
  Input height and width *must* be constant.
  For a sliding-window linear or non-linear map that connects just to the
  input depth, and leaves the input image size as-is, use a 1x1 convolution
  eg. Cr1,1,64 instead of Fr64.
L(f|r|b)(x|y)[s]<n> LSTM cell with n outputs.
  The LSTM must have one of:
    f runs the LSTM forward only.
    r runs the LSTM reversed only.
    b runs the LSTM bidirectionally.
  It will operate on either the x- or y-dimension, treating the other dimension
  independently (as if part of the batch).
  s (optional) summarizes the output in the requested dimension, outputting
    only the final step, collapsing the dimension to a single element.
LS<n> Forward-only LSTM cell in the x-direction, with built-in Softmax.
LE<n> Forward-only LSTM cell in the x-direction, with built-in softmax,
  with binary Encoding.

在以上內容中，(s|t|r|l|m) 指定非線性的類型

s = sigmoid
t = tanh
r = relu
l = linear (i.e., No non-linearity)
m = softmax

範例

Cr5,5,32 執行一個 5x5 Relu 卷積，深度/篩選器數量為 32。

Lfx128 在 x 維度中執行僅向前 LSTM，輸出為 128，獨立處理 y 維度。

Lfys64 在 y 維度中執行僅向前 LSTM，輸出為 64，獨立處理 x 維度，並將 y 維度塌陷為 1 個元素。

管道操作

管道操作允許建構任意複雜的圖形。目前缺少的是定義巨集的能力，例如在多個地方產生一個 Inception 單元。

[...] Execute ... networks in series (layers).
(...) Execute ... networks in parallel, with their output concatenated in depth.
S<y>,<x> Rescale 2-D input by shrink factor y,x, rearranging the data by
  increasing the depth of the input by factor xy.
  **NOTE** that the TF implementation of VGSLSpecs has a different S that is
  not yet implemented in Tesseract.
Mp<y>,<x> Maxpool the input, reducing each (y,x) rectangle to a single value.

完整範例：一個能夠進行高品質 OCR 的 1-D LSTM

[1,1,0,48 Lbx256 O1c105]

作為層描述：（輸入層在底部，輸出在頂部。）

O1c105: Output layer produces 1-d (sequence) output, trained with CTC,
  outputting 105 classes.
Lbx256: Bi-directional LSTM in x with 256 outputs
1,1,0,48: Input is a batch of 1 image of height 48 pixels in greyscale, treated
  as a 1-dimensional sequence of vertical pixel strips.
[]: The network is always expressed as a series of layers.

只要輸入圖像在垂直方向上仔細正規化，並且基準線和平均線處於固定位置，此網路就能很好地用於 OCR。

完整範例：一個能夠進行高品質 OCR 的多層 LSTM

[1,0,0,1 Ct5,5,16 Mp3,3 Lfys64 Lfx128 Lrx128 Lfx256 O1c105]

作為層描述：（輸入層在底部，輸出在頂部。）

O1c105: Output layer produces 1-d (sequence) output, trained with CTC,
  outputting 105 classes.
Lfx256: Forward-only LSTM in x with 256 outputs
Lrx128: Reverse-only LSTM in x with 128 outputs
Lfx128: Forward-only LSTM in x with 128 outputs
Lfys64: Dimension-summarizing LSTM, summarizing the y-dimension with 64 outputs
Mp3,3: 3 x 3 Maxpool
Ct5,5,16: 5 x 5 Convolution with 16 outputs and tanh non-linearity
1,0,0,1: Input is a batch of 1 image of variable size in greyscale
[]: The network is always expressed as a series of layers.

摘要 LSTM 使此網路更能適應文字位置的垂直變化。

可變大小輸入和摘要 LSTM

請注意，目前將未知大小的維度塌陷為已知大小 (1) 的唯一方法是透過使用摘要 LSTM。單個摘要 LSTM 會塌陷一個維度（x 或 y），留下一個 1 維序列。然後可以在另一個維度中塌陷 1 維序列，以產生一個 0 維分類（softmax）或嵌入（logistic）輸出。

因此，對於 OCR 目的，輸入圖像的高度必須是固定的，並由頂層垂直縮放（使用 Mp 或 S）至 1，或者為了允許可變高度的圖像，必須使用摘要 LSTM 將垂直維度塌陷為單個值。摘要 LSTM 也可以與固定高度的輸入一起使用。