Skip to content

docker/install: build Tesseract from source#197

Merged
bertsky merged 13 commits into
OCR-D:masterfrom
joschrew:dockerfile-update
Feb 14, 2024
Merged

docker/install: build Tesseract from source#197
bertsky merged 13 commits into
OCR-D:masterfrom
joschrew:dockerfile-update

Conversation

@joschrew

@joschrew joschrew commented Jan 31, 2024

Copy link
Copy Markdown
Contributor

This PR is part of series to offer single ocrd modules as Docker Containers (ocrd slim containers) to be used with ocr-d network.

This Dockerfile currently doesn't work in all cases and it still needs updates. I created the PR anyway because I use/need it for my tests. EDIT now works. (This basically migrates all the install-tesseract rules from ocrd_all's makefile here, where it actually belongs.)

My idea was to maybe create the tesseract Container with ocrd_all:

cd ocrd_all
git submodule update --init tesserocr/ core/ tesseract/ ocrd_tesserocr/
docker build --build-arg="OCRD_MODULES=core ocrd_tesserocr tesseract tesserocr " --no-cache -t my-ocrd-slim-container .

@codecov

codecov Bot commented Jan 31, 2024

Copy link
Copy Markdown

Welcome to Codecov 🎉

Once merged to your default branch, Codecov will compare your coverage reports and display the results in this comment.

Thanks for integrating Codecov - We've got you covered ☂️

@stweil

stweil commented Feb 6, 2024

Copy link
Copy Markdown
Contributor

I wonder whether there are still reasons for building the tesseract binary.

Using the package from a recent Linux distribution is simpler and would save significant build time.

Another possible approach would also work for tesserocr and some more parts of OCR-D: OCR-D could use its own package repositories for all parts with simple dependencies.

@bertsky

bertsky commented Feb 6, 2024

Copy link
Copy Markdown
Collaborator

I wonder whether there are still reasons for building the tesseract binary.

Using the package from a recent Linux distribution is simpler and would save significant build time.

Because most of the time, we cannot use Tesseract from a Linux distribution: our base distro is usually older than the current one, and we have no control over Tesseract features that we actually need. The same goes for PPA.

We had good reasons to pin to a specific Tesseract version via source build in subrepo. No reason to give that up now.

Another possible approach would also work for tesserocr and some more parts of OCR-D: OCR-D could use its own package repositories for all parts with simple dependencies.

Much simpler: conda

@joschrew

joschrew commented Feb 8, 2024

Copy link
Copy Markdown
Contributor Author

@kba: Your changes resolved all my erros with my test workspace. I added a resmgr call to the dockerimage to add eng traineddata. I get an error when trying to process without it.

Edit: Maybe equ.traineddata and osd.traineddata should be added as well, I am not sure

@bertsky bertsky left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great that this is working now.

Some cosmetic change requests below. Adapting CircleCI config should follow.

Comment thread Dockerfile Outdated
Comment thread Dockerfile
Comment thread Dockerfile Outdated
Comment thread Dockerfile
@bertsky

bertsky commented Feb 9, 2024

Copy link
Copy Markdown
Collaborator

Adapting CircleCI config should follow.

In fact, since it already seems broken on master – unfortunately CircleCI does not keep the logs long enough, but I guess it's about the TESSDATA_PREFIX / resmgr location – we should fix this here.

So I suggest (after rewriting deps-ubuntu as proposed above) to update the CircleCI config to do make install-tesseract install-tesserocr before make install.

Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
@bertsky

bertsky commented Feb 12, 2024

Copy link
Copy Markdown
Collaborator

In fact, since it already seems broken on master – unfortunately CircleCI does not keep the logs long enough, but I guess it's about the TESSDATA_PREFIX / resmgr location – we should fix this here.

So I suggest (after rewriting deps-ubuntu as proposed above) to update the CircleCI config to do make install-tesseract install-tesserocr before make install.

Now the CI config definitely needs make install-tesseract install-tesserocr. Also, we must drop the chmod workaround (for which there is no need anymore).

@bertsky

bertsky commented Feb 13, 2024

Copy link
Copy Markdown
Collaborator

Now the CI config definitely needs make install-tesseract install-tesserocr. Also, we must drop the chmod workaround (for which there is no need anymore).

@joschrew do you want me to make that change (on your fork's writable branch)?

make deps-ubuntu no longer fetches Tesseract via PPA, so we need to make install-tesseract

also, drop unsupported Python 3.6
(since normal Circleci `checkout` creates empty submodule directories)
using VIRTUAL_ENV from PYENV_ROOT
@bertsky bertsky self-requested a review February 14, 2024 10:53
@bertsky bertsky marked this pull request as ready for review February 14, 2024 10:53

@bertsky bertsky left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At last!

@bertsky bertsky requested a review from kba February 14, 2024 10:54
@bertsky bertsky changed the title Update dockerfile docker/install: build Tesseract from source Feb 14, 2024
@bertsky

bertsky commented Feb 14, 2024

Copy link
Copy Markdown
Collaborator

Oh, maybe we should also migrate make install tesseract-training here? (Once we remove these rules from ocrd_all, there would be no more way to compile lstmtraining, combine_tessdata etc.)

@bertsky bertsky merged commit bf29777 into OCR-D:master Feb 14, 2024
@bertsky bertsky mentioned this pull request Feb 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants