命令列用法
Tesseract ‘man’ 手冊頁
請參閱man 手冊頁以瞭解命令列語法和其他詳細資訊。
常見問題
請參閱常見問題以瞭解更多範例和提示。
Tesseract 5 中可用的 OCR 引擎
使用 --oem 1
來使用 LSTM/神經網路,使用 --oem 0
來使用傳統 Tesseract。
請注意,傳統 Tesseract 模型僅包含在來自 tessdata 儲存庫的 traineddata 檔案中。
tesseract input.tiff output --oem 1 -l eng
OCR 影像最簡單的呼叫方式
tesseract imagename outputbase
這會使用英文作為預設語言,並使用 3 作為頁面分割模式。預設輸出格式為文字。
用於方向和分割的 osd.traineddata,以及用於英文的 eng.traineddata 和其他語言資料檔案應位於 “tessdata” 目錄中。TESSDATA_PREFIX 環境變數應設定為 “tessdata” 目錄的父目錄。
如果 eng.traineddata 和 osd.traineddata 檔案位於 /usr/share/tessdata 目錄中,則以下命令會產生與上述相同的結果。
tesseract --tessdata-dir /usr/share imagename outputbase -l eng --psm 3
以下範例使用此影像,其中包含多種語言的文字。
使用單一語言
在命令中加入 ‘-l LANG’,其中 LANG 是支援語言清單中的三個字元語言代碼。如果未指定,則預設會假設為英文。
tesseract images/eurotext.png - -l eng
輸出
The (quick) [brown] {fox} jumps!
Over the $43,456.78 <lazy> #90 dog
& duck/goose, as 12.5% of E-mail
from aspammer@website.com is spam.
Der ,schnelle” braune Fuchs springt
iiber den faulen Hund. Le renard brun
«rapide» saute par-dessus le chien
paresseux. La volpe marrone rapida
salta sopra il cane pigro. El zorro
marrén rapido salta sobre el perro
perezoso. A raposa marrom ripida
salta sobre o cdo preguigoso.
使用多種語言
在命令列中加入 -l LANG[+LANG]
以同時使用多種語言進行辨識
tesseract images/eurotext.png - -l eng+deu
輸出
The (quick) [brown] {fox} jumps!
Over the $43,456.78 <lazy> #90 dog
& duck/goose, as 12.5% of E-mail
from aspammer@website.com is spam.
Der „schnelle” braune Fuchs springt
über den faulen Hund. Le renard brun
«rapide» saute par-dessus le chien
paresseux. La volpe marrone rapida
salta sopra il cane pigro. El zorro
marrén rapido salta sobre el perro
perezoso. A raposa marrom räpida
salta sobre o cdo preguigoso.
多種語言的順序
OCR 所需的時間以及輸出可能會因語言的順序而有所不同。
以下範例使用此影像,其中包含多種語言的文字 - 印地語和英文。
先使用英文作為主要語言,然後使用印地語
time tesseract images/bilingual.png - -l eng+hin
Estimating resolution as 638
हिंदी से अंग्रेजी
HINDI TO
ENGLISH
real 0m0.442s
user 0m0.622s
sys 0m0.062s
先使用印地語作為主要語言,然後使用英文
time tesseract images/bilingual.png - -l hin+eng
Estimating resolution as 638
हिंदी से अंग्रेजी
HINDI TO
ENGLISH
real 0m0.429s
user 0m0.550s
sys 0m0.074s
使用腳本/天城文作為主要語言(它支援天城文腳本中的所有語言和英文)
time tesseract images/bilingual.png - -l script/Devanagari
Estimating resolution as 638
हिंदी से अंग्रेजी
HINDI TO
ENGLISH
real 0m0.391s
user 0m0.459s
sys 0m0.093s
使用 quiet
設定來抑制訊息
在上述命令的末尾使用 quiet
將會抑制有關影像解析度的訊息。
time tesseract images/bilingual.png - -l script/Devanagari quiet
हिंदी से अंग्रेजी
HINDI TO
ENGLISH
real 0m0.416s
user 0m0.494s
sys 0m0.091s
可搜尋的 PDF 輸出
tesseract testing/eurotext.png testing/eurotext-eng -l eng pdf
這會建立一個包含影像和單獨的可搜尋文字圖層(包含辨識的文字)的 PDF。
HOCR 輸出
在命令末尾新增 hocr 來使用 ‘hocr’ 設定檔以取得 HOCR 輸出。
tesseract images/eurotext.png - -l eng hocr
部分輸出
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title></title>
<meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
<meta name='ocr-system' content='tesseract 5.0.1-64-g3c22' />
<meta name='ocr-capabilities' content='ocr_page ocr_carea ocr_par ocr_line ocrx_word ocrp_wconf'/>
</head>
<body>
<div class='ocr_page' id='page_1' title='image "images/eurotext.png"; bbox 0 0 640 500; ppageno 0; scan_res 300 300'>
<div class='ocr_carea' id='block_1_1' title="bbox 61 41 574 413">
<p class='ocr_par' id='par_1_1' lang='eng' title="bbox 61 41 574 413">
<span class='ocr_line' id='line_1_1' title="bbox 65 41 515 71; baseline 0.013 -11; x_size 25; x_descenders 5; x_ascenders 6">
<span class='ocrx_word' id='word_1_1' title='bbox 65 41 111 61; x_wconf 96'>The</span>
<span class='ocrx_word' id='word_1_2' title='bbox 128 42 217 66; x_wconf 95'>(quick)</span>
<span class='ocrx_word' id='word_1_3' title='bbox 235 43 330 68; x_wconf 95'>[brown]</span>
<span class='ocrx_word' id='word_1_4' title='bbox 349 44 415 69; x_wconf 94'>{fox}</span>
<span class='ocrx_word' id='word_1_5' title='bbox 429 45 515 71; x_wconf 96'>jumps!</span>
</span>
...
<span class='ocr_line' id='line_1_12' title="bbox 61 385 444 413; baseline 0.013 -9; x_size 24; x_descenders 4; x_ascenders 5">
<span class='ocrx_word' id='word_1_62' title='bbox 61 385 119 405; x_wconf 92'>salta</span>
<span class='ocrx_word' id='word_1_63' title='bbox 135 385 200 406; x_wconf 92'>sobre</span>
<span class='ocrx_word' id='word_1_64' title='bbox 216 392 229 406; x_wconf 83'>o</span>
<span class='ocrx_word' id='word_1_65' title='bbox 244 388 285 407; x_wconf 80'>cdo</span>
<span class='ocrx_word' id='word_1_66' title='bbox 300 388 444 413; x_wconf 92'>preguigoso.</span>
</span>
</p>
</div>
</div>
</body>
</html>
TSV 輸出
在命令末尾新增 tsv 來使用 ‘tsv’ 設定檔以取得 TSV 輸出。
tesseract images/eurotext.png - -l eng tsv
部分輸出
level page_num block_num par_num line_num word_num left top width height conf text
1 1 0 0 0 0 0 0 640 500 -1
2 1 1 0 0 0 61 41 513 372 -1
3 1 1 1 0 0 61 41 513 372 -1
4 1 1 1 1 0 65 41 450 30 -1
5 1 1 1 1 1 65 41 46 20 96.063751 The
5 1 1 1 1 2 128 42 89 24 95.965691 (quick)
5 1 1 1 1 3 235 43 95 25 95.835831 [brown]
5 1 1 1 1 4 349 44 66 25 94.899742 {fox}
5 1 1 1 1 5 429 45 86 26 96.683357 jumps!
4 1 1 1 2 0 65 72 490 31 -1
5 1 1 1 2 1 65 72 60 20 96.912064 Over
5 1 1 1 2 2 140 73 37 20 96.887390 the
5 1 1 1 2 3 194 73 139 24 93.263031 $43,456.78
5 1 1 1 2 4 350 76 85 25 90.893219 <lazy>
5 1 1 1 2 5 451 77 44 19 96.820717 #90
5 1 1 1 2 6 511 78 44 25 96.538940 dog
4 1 1 1 3 0 64 103 458 26 -1
使用不同的頁面分割模式
–psm 3 - 完全自動的頁面分割,但沒有 OSD。(預設)
以下範例使用此影像,其中包含多欄文字。
tesseract images/2col.png - --psm 3
Cautionary Statement
ON FORWARD-LOOKING STATEMENTS: This presentation includes
information, statements, beliefs and opinions which are forward-looking, and
which reflect current estimates, expectations and projections about future
events, referred to herein as “forward-looking statements” within the meaning
of the U.S> Private Securities Litigation Reform Act of 1995 or “forward-looking
information” under applicable securities laws. Statements containing the words
“believe”, “expect”, “continue”, “could, “potential”, “predict”, “would”,
“intend”, “should”, “seek”, “anticipate”, ‘will’, “opportunity,” “positioned”,
“poised,” “project”, “risk”, “plan”, “may”, “estimate” or, in each case, their
Historical Information: Historical statements contained in this document
regarding past trends or activities should not be taken as a representation that
such trends or activities will continue in the future. In this regard, certain
financial information contained herein has been extracted from, or based upon,
information available in the public domain and/or provided by the Company. In
particular, historical results should not be taken as a representation that such
trends will be replicated in the future. No statement in this document is
intended to be nor may be construed as a profit forecast.
–psm 6 - 假設為單一的統一文字區塊。
以下範例使用此影像,其中包含目錄。
tesseract images/toc.png - --psm 6
Contents
Introduction to the Tenth Anniversary Edition page xvii
Afterword to the Tenth Anniversary Edition xix
Preface xxi
Acknowledgements xxvii
Nomenclature and notation xxix
Part I Fundamental concepts 1
1 Introduction and overview 1
1.1 Global perspectives 1
1.11 History of quantum computation and quantum
information 2
1.1.2 Future directions 12
1.2 Quantum bits 13
1.2.1 Multiple qubits 16
1.3 Quantum computation 17
1.3.1 Single qubit gates 17
1.3.2 Multiple qubit gates 20
1.3.3 Measurements in bases other than the computational basis 2
1.34 Quantum circuits 2
1.3.5 Qubit copying circuit? 24
使用 -c preserve_interword_spaces=1 來保留空格
tesseract images/toc.png - --psm 6 -c preserve_interword_spaces=1
Contents
Introduction to the Tenth Anniversary Edition page xvii
Afterword to the Tenth Anniversary Edition xix
Preface xxi
Acknowledgements xxvii
Nomenclature and notation xxix
Part I Fundamental concepts 1
1 Introduction and overview 1
1.1 Global perspectives 1
1.11 History of quantum computation and quantum
information 2
1.1.2 Future directions 12
1.2 Quantum bits 13
1.2.1 Multiple qubits 16
1.3 Quantum computation 17
1.3.1 Single qubit gates 17
1.3.2 Multiple qubit gates 20
1.3.3 Measurements in bases other than the computational basis 2
1.34 Quantum circuits 2
1.3.5 Qubit copying circuit? 24
使用 pdftotext 來保留文字輸出的版面配置
tesseract images/toc.png images/toc -l eng –psm 11 pdf
pdftotext -layout images/toc.pdf -
Contents
Introduction to the Tenth Anniversary Edition page xvii
Afterword to the Tenth Anniversary Edition xix
Preface xxi
Acknowledgements xx
Nomenclature and notation xxix
Part I Fundamental concepts
1 Introduction and overview
1.1 Global perspectives
1.11 History of quantum computation and quantum
information
1.1.2 Future directions 12
12 Quantum bits 13
1.2.1 Multiple qubits 16
1.3 Quantum computation 17
Single qubit gates
2 Multiple qubit gates 20
Measurements in bases other than the computational basis 2
4 Quantum circu 2
5 Qubit copying circuit? 24