FFe: Sync tesseract-* 3.02.01-1 (universe) from Debian sid (main)

Bug #933162 reported by Jeff Breidenbach
16
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tesseract (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

Please sync tesseract 3.02.01-1 (universe) from Debian sid (main)

Explanation of the Ubuntu delta and why it can be dropped:

This is a major upstream release that incorporated a lot of
improvements and fixes. It runs patch free on Debian.

Explanation of FeatureFreeze exception:

Tesseract is an optical character recognition (OCR) program designed
to turn images into symbolic text. It is developed with support from
Google as part of a book digitization effort, and can have major
impact wherever digitizing paper documents are important. Debian's
version of Tesseract contains improvements to accuracy, and expands
out-of-the-box language support from 6 European languages to 65
languages from all over the world. I am the Debian package
co-maintainer, and have a close relationship with upstream.

There are a number of complications with this synchronization request,
possibly beyond what Ubuntu would normally consider.

  Tesseract 3.0x has only been in Debian for a couple of weeks. It is
  still in Debian Unstable due to package name changes. For example
  tesseract-ocr-dev became libtesseract-dev

  Each language is packaged separately, so we are talking 65+
  packages. See the Debian maintainer page for all tesseract-*
  packages. We would like all except for tesseract-ocr-lat-lid to
  enter Ubuntu 12.04.
  http://qa.debian.org/developer.php?<email address hidden>
  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=659934

  Several packages execute the tesseract binary. They should work fine.
  $ whodepends tesseract-ocr
  gscan2pdf
  ocrfeeder
  ocrodjvu
  slimrat
  slimrat-nox

  Two package build-depend on Tesseract
  $ build-rdeps tesseract-ocr-dev
  ocropus
  sikuli

  Ocropus FTBFS due to package name changes and API changes in
  Tesseract. Ocropus upstream requests that the obsolete Ocropus
  package be withdrawn entirely.
  http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=658478

  Sikuli has been updated to work with Tesseract 3.02. This is in
  Debian Unstable, it has not yet migrated to Debian testing.

Changelog entries since current precise version 2.04-2.1ubuntu2:

tesseract (3.02.01-1) unstable; urgency=low

  * New upstream release
  * Upstream fixed a segfault (closes: #658634)
  * Upstream wrote some missing manpages.

 -- Jeff Breidenbach <email address hidden> Tue, 14 Feb 2012 18:30:21 -0800

tesseract (3.02-3) unstable; urgency=low

  * lintian: ancient-standards-version, quilt-build-dep-but-no-series-file
  * lintian: wrong-section-according-to-package-name
  * simplify dependencies and require English (closes: 658099)

 -- Jeff Breidenbach <email address hidden> Sat, 04 Feb 2012 15:27:27 -0800

tesseract (3.02-2) unstable; urgency=medium

  * Deal with file moving to new package name (closes: #658476)
  * Move .so symlink to the dev package (closes: #658472)
  * tesseract 3.0x officially breaks ocropus 0.3.x (closes: #658095)
  * Add dependency to equation "language" at request of upstream
  * Note that 3.0x tesseract-ocr-dev was renamed to libtesseract-dev
  * Bumping urgency to medium due to looming propagation deadlines

 -- Jeff Breidenbach <email address hidden> Fri, 03 Feb 2012 10:10:07 -0800

tesseract (3.02-1) unstable; urgency=low

  * New upstream release
  * 3.0x doesn't have trouble with finding files (closes: #558254)
  * 3.0x now works with TIFF format (closes: #589726)
  * Fix subtlety in dependency versioning (closes: #658099)
  * Fix another subtlety in dependency versioning (closes: #658095)
  * 3.0x deals with 16bpp TIFF (closes: #634232)
  * 3.0x deals with .tiff extension properly (closes: #523907)
  * 3.0x has better overall error handling (closes: #551190)

 -- Jeff Breidenbach <email address hidden> Wed, 01 Feb 2012 17:26:22 -0800

tesseract (3.01-3) unstable; urgency=low

  * Hey we are shipping version 3.x (closes: #599045)
  * Death to .la files (closes: #658102)
  * Temporarily remove osd dependency (closes: #658167)
  * tesseract-ocr-osd dependency now valid (closes: #658167)
  * Better package names for shared libraries (closes: #658097)
  * Tersify descriptions a little bit

 -- Jeff Breidenbach <email address hidden> Tue, 31 Jan 2012 14:22:52 -0800

tesseract (3.01-2) unstable; urgency=low

  * Add dependency on script + orientation detection.

 -- Jeff Breidenbach <email address hidden> Mon, 30 Jan 2012 17:08:47 -0800

tesseract (3.01-1) unstable; urgency=low

  * New upstream release

 -- Jeff Breidenbach <email address hidden> Mon, 30 Jan 2012 09:12:42 -0800

Revision history for this message
Etienne Goyer (etienne-goyer-outlands) wrote : Re: Sync tesseract 3.02.01-1 (universe) from Debian sid (main)

Jeff, it's not yet FF (it's at 21:00 UTC today), so it does not need to be an FFe. At least, not yet!

I have subscribed ubuntu-sponsors in the hope that someone on that team have some time to look at this before the actual freeze.

summary: - FFe: Sync tesseract 3.02.01-1 (universe) from Debian sid (main)
+ Sync tesseract 3.02.01-1 (universe) from Debian sid (main)
Revision history for this message
Jeff Breidenbach (jeff-jab) wrote :

To be 100% clear, this is really a sync request for tesseract-*

Revision history for this message
David Eger (david-eger) wrote :

As one of the authors, I'd love to see users get easy (packaged) access to a recent release of Tesseract. 3.02 is a big update with lots of improvements: page layout, multi-language mode, bidirectional OCR, increased accuracy, and many more languages.

Revision history for this message
Micah Gersten (micahg) wrote :

Taking a look

Changed in tesseract (Ubuntu):
assignee: nobody → Micah Gersten (micahg)
status: New → In Progress
Revision history for this message
Micah Gersten (micahg) wrote :

Thank you for trying to keep Ubuntu up to date. It seems that the patch from Ubuntu wasn't included in this upstream release: http://launchpadlibrarian.net/75401530/tesseract_2.04-2.1_2.04-2.1ubuntu1.diff.gz Would it be possible to do a merge or would you like me to do this?

Changed in tesseract (Ubuntu):
assignee: Micah Gersten (micahg) → nobody
status: In Progress → Incomplete
Revision history for this message
Jeff Breidenbach (jeff-jab) wrote :

>Would it be possible to do a merge or would you like me to do this?

The fix has just been merged into upstream subversion. I will add also it to Debian right now, as a maintainer patch. Anything else?

Revision history for this message
Jeff Breidenbach (jeff-jab) wrote :

Okay, tesseract_3.02.01-2 has just been uploaded to Debian to include this patch.

Revision history for this message
Iain Lane (laney) wrote :

This certainly needs a freeze exception at this point, so please nobody sponsor until it is granted.

I see on the linked Debian bug that sikuli has reported worse performance with this new series. Is that not a concern or is their usecase no longer as well supported?

Could you please give an explicit list of all packages to be synced?

I must admit to being a bit concerned about the way that ocropus was broken without apparently warning its maintainer too, especially given that there is no replacement available yet.

Revision history for this message
Jeff Breidenbach (jeff-jab) wrote :
Download full text (3.7 KiB)

>I see on the linked Debian bug that sikuli has reported worse performance with this new series.
>Is that not a concern or is their usecase no longer as well supported?

Tesseract upstream is in communication with Sikuli upstream. a 10% drop in recognition
performance is considered acceptable by Sikuli upstream. Additionally, future releases
of Sikuli may remove that penalty now that the two upstreams are in communication.
Here is the relevant quote from Sikuli upstream Tsung-Hsiang (Sean) Chang.

  "The main reason we aren't not switching to tesseract 3 in an official release is
  that its recognition performance is worse than 2.04 in our dataset. (Not very bad,
  about 10% worse as I recall.) So I think it's fine to wrap the tesseract 3 branch for
  Debian sid."

>Could you please give an explicit list of all packages to be synced?

Appended.

>I must admit to being a bit concerned about the way that ocropus was broken
>without apparently warning its maintainer too, especially given that there is
>no replacement available yet.

This is a reasonable concern. I assume you are referring to
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=659597

From an etiquette perspective, I have been in email communication (8 threads
in the last 16 days) with Jeffrey Ratcliffe, who is the first listed maintainer for
both Tesseract and Ocropus. I have been in bug tracking communication (7 bugs)
with Jakub Wilk over the same period. Jakub has been incredibly helpful by filing
those packaging bugs. That said, my strategy was to - with blessing from my
co-maintainer - bring Tesseract 3 into Debian unstable, then find and fix problems
as quickly as possible. I apologize for causing surprise.

However, that leaves the issue of Ocropus. If Ubuntu 12.04 accepts Tesseract 3, it
will lose Ocropus. I respect Ubuntu's decision whichever way it goes. Please
consider the number of users affected on either side, and also Ocropus upstream
Tom Breuel's comments.

   "the version of OCRopus that has been packaged is completely outdated. OCRopus
  is now a set of Python libraries with a little bit of C++ in each. The complete final
  package structure isn't settled yet, but I want different components to be fairly
  independent of each other. Now, during my sabbatical, I've finally had time to actually
  work on it more than just a little on the side. The best thing for Debian probably would
  be to discontinue the current packaging for OCRopus and start over again when the
  new release is out."

Thank you for your consideration.

=========

Full list of source packages to remove:

ocropus
tesseract-ocr-deu-f

Full list of non-source packages to remove (maybe this goes away automatically):

tesseract-ocr-dev

Full list of source packages to sync (note lack of tesseract-lat-lid):

sikuli
tesseract
tesseract-afr
tesseract-ara
tesseract-aze
tesseract-bel
tesseract-ben
tesseract-bul
tesseract-cat
tesseract-ces
tesseract-chi-sim
tesseract-chi-tra
tesseract-chr
tesseract-dan
tesseract-deu
tesseract-deu-frak
tesseract-ell
tesseract-eng
tesseract-enm
tesseract-epo
tesseract-equ
tesseract-est
tesseract-eus
tesseract-fin
tesseract-fra
tesseract-frk
tesse...

Read more...

Revision history for this message
Jeff Breidenbach (jeff-jab) wrote :

Trying again with proper formatting.

>I see on the linked Debian bug that sikuli has reported worse
>performance with this new series. Is that not a concern or is their
>usecase no longer as well supported?

Tesseract upstream is in communication with Sikuli upstream. a 10%
drop in recognition performance is considered acceptable by Sikuli
upstream. Additionally, future releases of Sikuli may remove that
penalty now that the two upstreams are in communication. Here is the
relevant quote from Sikuli upstream Tsung-Hsiang (Sean) Chang.

  "The main reason we aren't not switching to tesseract 3 in an
  official release is that its recognition performance is worse than
  2.04 in our dataset. (Not very bad, about 10% worse as I recall.) So
  I think it's fine to wrap the tesseract 3 branch for Debian sid."

>Could you please give an explicit list of all packages to be synced?

Appended.

>I must admit to being a bit concerned about the way that ocropus was
>broken without apparently warning its maintainer too, especially
>given that there is no replacement available yet.

This is a reasonable concern. I assume you are referring to
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=659597

From an etiquette perspective, I have been in email communication (8
threads in the last 16 days) with Jeffrey Ratcliffe, who is the first
listed maintainer for both Tesseract and Ocropus. I have been in bug
tracking communication (7 bugs) with Jakub Wilk over the same
period. Jakub has been incredibly helpful by filing those packaging
bugs. That said, my strategy was to - with blessing from my
co-maintainer - bring Tesseract 3 into Debian unstable, then find and
fix problems as quickly as possible. I apologize for causing surprise.

However, that leaves the issue of Ocropus. If Ubuntu 12.04 accepts
Tesseract 3, it will lose Ocropus. I respect Ubuntu's decision
whichever way it goes. Please consider the number of users affected on
either side, and also Ocropus upstream Tom Breuel's comments.

   "the version of OCRopus that has been packaged is completely
  outdated. OCRopus is now a set of Python libraries with a little bit
  of C++ in each. The complete final package structure isn't settled
  yet, but I want different components to be fairly independent of
  each other. Now, during my sabbatical, I've finally had time to
  actually work on it more than just a little on the side. The best
  thing for Debian probably would be to discontinue the current
  packaging for OCRopus and start over again when the new release is
  out."

Thank you for your consideration.

summary: - Sync tesseract 3.02.01-1 (universe) from Debian sid (main)
+ FFe: Sync tesseract 3.02.01-1 (universe) from Debian sid (main)
Revision history for this message
Iain Lane (laney) wrote : Re: FFe: Sync tesseract 3.02.01-1 (universe) from Debian sid (main)

I would probably prefer to see this happen in 12.10, in the hope that the problems with sikuli and ocropus are resolved in the interim.

But I won't block you from proceeding in 12.04 if you want to, so the FFe is approved by me.

I'm leaving as New and subscribing the sponsors for you. Comment #9 shows how to proceed. I'd appreciate a courtesy check that an archive administrator doesn't mind performing the work that this will cause, and please make sure to watch all changed packages for bug reports.

Please have this done by Beta 2 freeze, March 22.

Changed in tesseract (Ubuntu):
status: Incomplete → New
Revision history for this message
Scott Kitterman (kitterman) wrote :

I can't do the removal, but removing one package shouldn't be an issue. I don't mind doing the New processing for syncs from Debian.

Is the ocropus we have now working? If not it seems this should go ahead.

Revision history for this message
Jeff Breidenbach (jeff-jab) wrote :

Good news, the new tesseract has now entered Debian testing.

>I would probably prefer to see this happen in 12.10, in the hope that
>the problems with sikuli and ocropus are resolved in the interim.

Sikuli is good to go. See version 1.0~x~rc3.tesseract3-dfsg1-1 in Debian
Unstable. It is still waiting to enter Debian testing.

Ocropus is broken in Debian, it can neither build nor run. This will
not change until upstream completes a major rewrite.

>Is the ocropus we have now working? If not it seems this should go ahead.

If I am reading the bug reports correctly, the current Ubuntu ocropus
is working, but is obsolete as per Debian and Ubuntu bug reports and
upstream commentary.

This really does look like a (new) Tesseract vs (old) Ocropus
tradeoff. There are a stack of people peering over my shoulder
muttering about why the former is more important, let me know if
you would like to hear from them.

https://bugs.launchpad.net/ubuntu/+source/ocropus/+bug/500527

Revision history for this message
Jeff Breidenbach (jeff-jab) wrote :

(When I said Sikuli is good to go, I mean that it builds, runs, and upstream is happy. There has not been additional performance tuning.)

Revision history for this message
Scott Kitterman (kitterman) wrote : Re: [Bug 933162] Re: FFe: Sync tesseract 3.02.01-1 (universe) from Debian sid (main)

Looking at popcon, I think tesseract is used ~an order of magnitude more than
ocropus.

Revision history for this message
Jeff Breidenbach (jeff-jab) wrote : Re: FFe: Sync tesseract 3.02.01-1 (universe) from Debian sid (main)

>I'm leaving as New and subscribing the sponsors for you. Comment #9
>shows how to proceed. I'd appreciate a courtesy check that an archive
>administrator doesn't mind performing the work that this will cause, and
>please make sure to watch all changed packages for bug reports.

I'm not sure who "you" is in this context. If "you" means Jeff, I am
a Debian Developer but otherwise not affiliated with Ubuntu
development. Happy to watch for bug reports but I don't know how
to talk to archive administrators.

Bryce Harrington (bryce)
summary: - FFe: Sync tesseract 3.02.01-1 (universe) from Debian sid (main)
+ FFe: Sync tesseract-* 3.02.01-1 (universe) from Debian sid (main)
Revision history for this message
Daniel Holbach (dholbach) wrote :
Revision history for this message
Stefano Rivera (stefanor) wrote :

Syncing these packages.

Revision history for this message
Stefano Rivera (stefanor) wrote :

All syncs (tesseract, tesseract-*, sikuili) filed.

ocropus removal requested in LP: #941136

Changed in tesseract (Ubuntu):
status: New → Fix Released
Revision history for this message
Jeff Breidenbach (jeff-jab) wrote :

Stefano, this is beyond fantastic. Just checking - does the sync include these removals, as mentioned in comment #9?

RM: tesseract-ocr-deu-f -- ROM package renamed to tesseract-ocr-deu-frak
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=660677

RM: tesseract-ocr-dev -- ROM replaced by libtesseract-dev
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=660680

Revision history for this message
Scott Kitterman (kitterman) wrote : Re: [Bug 933162] Re: FFe: Sync tesseract-* 3.02.01-1 (universe) from Debian sid (main)

> does the sync include these removals, as mentioned in comment #9?

The removals are subject of a separate process that only a few people can do.
They'll get done, but usually take longer.

Revision history for this message
Stefano Rivera (stefanor) wrote :

Ah, I missed tesseract-ocr-deu-f, thanks. Removal requseted in LP: #941292.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.