Discussion:
[XeTeX] potential new feature: \XeTeXgenerateactualtext
Jonathan Kew
2016-02-23 14:43:35 UTC
Permalink
The code for the \XeTeXgenerateactualtext feature (it's an integer
parameter; set it to 1 to get ActualText added to the PDF, for better
copy/paste and search in Acrobat) is now on sourceforge, in an
"actualtext" branch, for anyone who wants to try building and
experimenting with it.

Note that this requires a new version of xdvipdfmx, as it uses a new DVI
opcode. The patch for xdvipdfmx is attached here (based on the current
TeXLive svn source).

Akira, if you could check that the patch seems OK, that would be great.
I've not really looked at dvipdfm-x code in a long time. I haven't
pushed this it to TL yet, as it's all rather experimental, but I hope we
can safely include it for TL'16.

JK
Adam Twardoch (List)
2016-02-23 14:52:34 UTC
Permalink
Jonathan,

this is splendid. Adding support for the PDF "ActualText" tagging layer is a huge step.

I wonder — what happens in case of mathematical formulae?

I think it would be rather clever to embed the TeX notation or even, huh huh, MathML into the ActualText layer for the math mode — per equation, of course :) . Or use the "Unicode math linear format" as proposed by Microsoft:
http://www.unicode.org/notes/tn28/UTN28-PlainTextMath-v3.pdf

A.

Sent from my mobile phone.
The code for the \XeTeXgenerateactualtext feature (it's an integer parameter; set it to 1 to get ActualText added to the PDF, for better copy/paste and search in Acrobat) is now on sourceforge, in an "actualtext" branch, for anyone who wants to try building and experimenting with it.
Note that this requires a new version of xdvipdfmx, as it uses a new DVI opcode. The patch for xdvipdfmx is attached here (based on the current TeXLive svn source).
Akira, if you could check that the patch seems OK, that would be great. I've not really looked at dvipdfm-x code in a long time. I haven't pushed this it to TL yet, as it's all rather experimental, but I hope we can safely include it for TL'16.
JK
<xdvipdfmx-for-xetex-0_99995.patch>
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
Jonathan Kew
2016-02-23 15:00:21 UTC
Permalink
Post by Adam Twardoch (List)
Jonathan,
this is splendid. Adding support for the PDF "ActualText" tagging layer is a huge step.
I wonder — what happens in case of mathematical formulae?
At this point, nothing in particular. :)
Post by Adam Twardoch (List)
I think it would be rather clever to embed the TeX notation or even, huh
huh, MathML into the ActualText layer for the math mode — per equation,
of course :) .
I think these are ideas that could usefully be explored/implemented at
the macro level, rather than being built in to the engine.

JK

Or use the "Unicode math linear format" as proposed by
Post by Adam Twardoch (List)
http://www.unicode.org/notes/tn28/UTN28-PlainTextMath-v3.pdf
A.
Sent from my mobile phone.
Post by Jonathan Kew
The code for the \XeTeXgenerateactualtext feature (it's an integer
parameter; set it to 1 to get ActualText added to the PDF, for better
copy/paste and search in Acrobat) is now on sourceforge, in an
"actualtext" branch, for anyone who wants to try building and
experimenting with it.
Note that this requires a new version of xdvipdfmx, as it uses a new
DVI opcode. The patch for xdvipdfmx is attached here (based on the
current TeXLive svn source).
Akira, if you could check that the patch seems OK, that would be
great. I've not really looked at dvipdfm-x code in a long time. I
haven't pushed this it to TL yet, as it's all rather experimental, but
I hope we can safely include it for TL'16.
JK
<xdvipdfmx-for-xetex-0_99995.patch>
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex
Adam Twardoch (List)
2016-02-23 15:29:36 UTC
Permalink
Jonathan,

is there any method in XeTeX to explicitly emit "ActualText" or override the automatic content generated by the new option?

Or could you envision such a method? How would one need to approach it?

(I'm not saying you should try implement it right away). :)

A.

Sent from my mobile phone.
Post by Jonathan Kew
Post by Adam Twardoch (List)
Jonathan,
this is splendid. Adding support for the PDF "ActualText" tagging layer is a huge step.
I wonder — what happens in case of mathematical formulae?
At this point, nothing in particular. :)
Post by Adam Twardoch (List)
I think it would be rather clever to embed the TeX notation or even, huh
huh, MathML into the ActualText layer for the math mode — per equation,
of course :) .
I think these are ideas that could usefully be explored/implemented at the macro level, rather than being built in to the engine.
JK
Or use the "Unicode math linear format" as proposed by
Post by Adam Twardoch (List)
http://www.unicode.org/notes/tn28/UTN28-PlainTextMath-v3.pdf
A.
Sent from my mobile phone.
Post by Jonathan Kew
The code for the \XeTeXgenerateactualtext feature (it's an integer
parameter; set it to 1 to get ActualText added to the PDF, for better
copy/paste and search in Acrobat) is now on sourceforge, in an
"actualtext" branch, for anyone who wants to try building and
experimenting with it.
Note that this requires a new version of xdvipdfmx, as it uses a new
DVI opcode. The patch for xdvipdfmx is attached here (based on the
current TeXLive svn source).
Akira, if you could check that the patch seems OK, that would be
great. I've not really looked at dvipdfm-x code in a long time. I
haven't pushed this it to TL yet, as it's all rather experimental, but
I hope we can safely include it for TL'16.
JK
<xdvipdfmx-for-xetex-0_99995.patch>
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/
Jonathan Kew
2016-02-23 15:50:33 UTC
Permalink
Post by Adam Twardoch (List)
Jonathan,
is there any method in XeTeX to explicitly emit "ActualText" or override the automatic content generated by the new option?
Not currently. What you get is the Unicode text of each "word"
(consecutive run of non-space characters in a given font).
Post by Adam Twardoch (List)
Or could you envision such a method? How would one need to approach it?
(I'm not saying you should try implement it right away). :)
For a document that wants some other kind of "ActualText", there's going
to need to be pretty detailed markup in the source, I think. (E.g. each
word, or similar unit, will need to be tagged to provide the desired
ActualText that goes with it.) At that point, I wonder if turning off
\XeTeXgenerateactualtext and just doing it "manually" with macros that
generate \special{}s would be the most reasonable way forward.

I suppose it's possible you might want automatic ActualText for most of
the content, but custom overrides for certain fragments. At this point,
there's no support for that -- \XeTeXgenerateactualtext is a switch that
takes effect at \shipout time, so in effect it is "global" for all the
content on a page -- but perhaps we could make it scoped, so that you
could toggle it on/off at will within the text.

That probably wouldn't be hard to do; I'll give it a bit more thought.

JK
Post by Adam Twardoch (List)
A.
Sent from my mobile phone.
Post by Jonathan Kew
Post by Adam Twardoch (List)
Jonathan,
this is splendid. Adding support for the PDF "ActualText" tagging layer is a huge step.
I wonder — what happens in case of mathematical formulae?
At this point, nothing in particular. :)
Post by Adam Twardoch (List)
I think it would be rather clever to embed the TeX notation or even, huh
huh, MathML into the ActualText layer for the math mode — per equation,
of course :) .
I think these are ideas that could usefully be explored/implemented at the macro level, rather than being built in to the engine.
JK
Or use the "Unicode math linear format" as proposed by
Post by Adam Twardoch (List)
http://www.unicode.org/notes/tn28/UTN28-PlainTextMath-v3.pdf
A.
Sent from my mobile phone.
Post by Jonathan Kew
The code for the \XeTeXgenerateactualtext feature (it's an integer
parameter; set it to 1 to get ActualText added to the PDF, for better
copy/paste and search in Acrobat) is now on sourceforge, in an
"actualtext" branch, for anyone who wants to try building and
experimenting with it.
Note that this requires a new version of xdvipdfmx, as it uses a new
DVI opcode. The patch for xdvipdfmx is attached here (based on the
current TeXLive svn source).
Akira, if you could check that the patch seems OK, that would be
great. I've not really looked at dvipdfm-x code in a long time. I
haven't pushed this it to TL yet, as it's all rather experimental, but
I hope we can safely include it for TL'16.
JK
<xdvipdfmx-for-xetex-0_99995.patch>
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xe
Will Robertson
2016-02-24 23:31:04 UTC
Permalink
For a document that wants some other kind of "ActualText", there's going to need to be pretty detailed markup in the source, I think. (E.g. each word, or similar unit, will need to be tagged to provide the desired ActualText that goes with it.) At that point, I wonder if turning off \XeTeXgenerateactualtext and just doing it "manually" with macros that generate \special{}s would be the most reasonable way forward.
This sounds interesting for maths, where there is a chance we could automatically insert \special{}s at the glyph and/or the equation level — has this always been possible in XeTeX or does this require the newest patch for xdvipdfmx you just released?

Cheers,
Will





--------------------------------------------------
Subscriptions, Archive, and List infor
Ross Moore
2016-02-25 01:05:45 UTC
Permalink
Hi Will, Jonathan, and others
Post by Will Robertson
For a document that wants some other kind of "ActualText", there's going to need to be pretty detailed markup in the source, I think. (E.g. each word, or similar unit, will need to be tagged to provide the desired ActualText that goes with it.) At that point, I wonder if turning off \XeTeXgenerateactualtext and just doing it "manually" with macros that generate \special{}s would be the most reasonable way forward.
You have to be *very* careful with /ActualText, since it must be done using PDFdoc encoding,
as it becomes part of the page contents stream.
Any errors will corrupt the PDF file completely — but that’s true of other things as well.
Heiko’s \pdfstringdef in the hyperref package is very good for handling this...
Post by Will Robertson
This sounds interesting for maths, where there is a chance we could automatically insert \special{}s at the glyph and/or the equation level — has this always been possible in XeTeX or does this require the newest patch for xdvipdfmx you just released?
… but doing the math-characters correctly, without interfering with spacings,
is highly non-trivial.

Look at some of my papers associated with TUG conferences, to see various
options that can be used to make mathematics more accessible in PDFs; i.e.,
papers numbered as 5, 6, 7 on this page:

http://www.tug.org/twg/accessibility/

Although these were done using pdfTeX, some of these things should be able
to be implemented for XeTeX + xdvipdfmx also.
Post by Will Robertson
Cheers,
Will
Cheers,

Ross




--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/
Will Robertson
2016-02-25 06:19:57 UTC
Permalink
Hi Ross,

Great to hear from you.
I thought of you straight away when writing my email :)
Post by Ross Moore
You have to be *very* careful with /ActualText, since it must be done using PDFdoc encoding,
as it becomes part of the page contents stream.
Any errors will corrupt the PDF file completely — but that’s true of other things as well.
Heiko’s \pdfstringdef in the hyperref package is very good for handling this…
That’s good to know, thanks.
I think there has been *some* work by one or two of the LaTeX3 members on general methods for this sort of thing, but it’s been a while.
Post by Ross Moore
Post by Will Robertson
This sounds interesting for maths, where there is a chance we could automatically insert \special{}s at the glyph and/or the equation level — has this always been possible in XeTeX or does this require the newest patch for xdvipdfmx you just released?
… but doing the math-characters correctly, without interfering with spacings,
is highly non-trivial.
I have no doubt!!
Post by Ross Moore
Look at some of my papers associated with TUG conferences, to see various
options that can be used to make mathematics more accessible in PDFs; i.e.,
http://www.tug.org/twg/accessibility/
Although these were done using pdfTeX, some of these things should be able
to be implemented for XeTeX + xdvipdfmx also.
This is exactly where I was going with all this (so we’re getting quite far away from the new primitive).
My understanding is that the extended pdfTeX you were using was included in TeX Live 2015, is that right? Or will be in TL2016?

How much work would it be to translate that work into something that will also function in XeTeX?

Cheers,
Will





--------------------------------------------------
Subscriptions, Archive, and List informat
Ross Moore
2016-02-25 06:39:32 UTC
Permalink
Hi Will,
Post by Will Robertson
Hi Ross,
Great to hear from you.
I thought of you straight away when writing my email :)
Post by Ross Moore
You have to be *very* careful with /ActualText, since it must be done using PDFdoc encoding,
as it becomes part of the page contents stream.
Any errors will corrupt the PDF file completely — but that’s true of other things as well.
Heiko’s \pdfstringdef in the hyperref package is very good for handling this…
That’s good to know, thanks.
I think there has been *some* work by one or two of the LaTeX3 members on general methods for this sort of thing, but it’s been a while.
Send me their names.
I may have a bit more time this year.
Post by Will Robertson
Post by Ross Moore
Look at some of my papers associated with TUG conferences, to see various
options that can be used to make mathematics more accessible in PDFs; i.e.,
http://www.tug.org/twg/accessibility/
Although these were done using pdfTeX, some of these things should be able
to be implemented for XeTeX + xdvipdfmx also.
This is exactly where I was going with all this (so we’re getting quite far away from the new primitive).
My understanding is that the extended pdfTeX you were using was included in TeX Live 2015, is that right? Or will be in TL2016?
The later papers, which are not directly on “Tagged PDF”, don’t require
the special tagging features.
Post by Will Robertson
How much work would it be to translate that work into something that will also function in XeTeX?
That depends on how easy it is to create PDF objects and object references
between them.
Since I don’t know how xdvipdfmx does it — using pdfmark ? as does dvips ?
then it’s nowhere near as convenient as with pdfTeX.

Hopefully someone with the necessary experience can pick up on those ideas.
That’s why I’ve followed up your comment on this list.
Indeed, we need someone to get pdfx.sty working with XeLaTeX;
it’s for similar reasons that it doesn’t do so already.

Switch it to another thread, if you think that is appropriate.
Post by Will Robertson
Cheers,
Will
Cheers,

Ross






--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listin
Zdenek Wagner
2016-02-25 09:17:44 UTC
Permalink
Post by Ross Moore
Hi Will, Jonathan, and others
Post by Jonathan Kew
For a document that wants some other kind of "ActualText", there's
going to need to be pretty detailed markup in the source, I think. (E.g.
each word, or similar unit, will need to be tagged to provide the desired
ActualText that goes with it.) At that point, I wonder if turning off
\XeTeXgenerateactualtext and just doing it "manually" with macros that
generate \special{}s would be the most reasonable way forward.
You have to be *very* careful with /ActualText, since it must be done
using PDFdoc encoding,
as it becomes part of the page contents stream.
I thought so a few years ago but /ActualText may be done in Unicode if the
string is prepended with BOM.

Zdeněk Wagner
http://ttsm.icpf.cas.cz/team/wagner.shtml
http://icebearsoft.euweb.cz

...
Post by Ross Moore
Cheers,
Ross
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
Jonathan Kew
2016-02-25 10:03:23 UTC
Permalink
Post by Will Robertson
Post by Jonathan Kew
For a document that wants some other kind of "ActualText", there's
going to need to be pretty detailed markup in the source, I think.
(E.g. each word, or similar unit, will need to be tagged to provide
the desired ActualText that goes with it.) At that point, I wonder
if turning off \XeTeXgenerateactualtext and just doing it
"manually" with macros that generate \special{}s would be the most
reasonable way forward.
This sounds interesting for maths, where there is a chance we could
automatically insert \special{}s at the glyph and/or the equation
level — has this always been possible in XeTeX or does this require
the newest patch for xdvipdfmx you just released?
The xdvipdfmx patch does not have any effect on \special{} handling; the
implementation of \XeTeXgenerateactualtext doesn't put traditional
"special"s in the output, it uses a new DVI opcode to provide the
text+glyphs for each word.

I'd guess it has always been possible, in principle, to attach
ActualText to math at the macro level, using \specials{}s to write the
necessary PDF code directly. But I confess I haven't really looked into
what this would involve.... perhaps there are obstacles that make it
impractical.

JK


--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman
Will Robertson
2016-02-25 10:47:39 UTC
Permalink
I'd guess it has always been possible, in principle, to attach ActualText to math at the macro level, using \specials{}s to write the necessary PDF code directly. But I confess I haven't really looked into what this would involve.... perhaps there are obstacles that make it impractical.
Interesting… thanks for the confirmation, Jonathan.

Best regards,
Will




--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/l
Akira Kakuto
2016-02-23 15:46:55 UTC
Permalink
Hi Jonathan,
Post by Jonathan Kew
Akira, if you could check that the patch seems OK, that would be great.
I've not really looked at dvipdfm-x code in a long time. I haven't
pushed this it to TL yet, as it's all rather experimental, but I hope we
can safely include it for TL'16.
Thanks very much. I think it is OK, so I installed as r39835.

Thanks,
Akira



--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex
Akira Kakuto
2016-02-23 16:46:02 UTC
Permalink
Hi Honathan,
Post by Jonathan Kew
The code for the \XeTeXgenerateactualtext feature (it's an integer
parameter; set it to 1 to get ActualText added to the PDF, for better
copy/paste and search in Acrobat) is now on sourceforge, in an
"actualtext" branch, for anyone who wants to try building and
experimenting with it.
In the case where \XeTeXgenerateactualtext=1, I can not
select a full text by Adobe Acrobat XI Pro on Windows.
I can select about half of a text.
Does it depend on a reader?
Copied text is OK, that is a full text.
See and test with an attached actest.zip.

Thanks,
Akira
Zdenek Wagner
2016-02-23 16:50:42 UTC
Permalink
Hi Akira,

I have a similar problem in Linux. As Jonathan wrote, highlighting is quite
weird but the result is OK.

Zdeněk Wagner
http://ttsm.icpf.cas.cz/team/wagner.shtml
http://icebearsoft.euweb.cz
Post by Akira Kakuto
Hi Honathan,
The code for the \XeTeXgenerateactualtext feature (it's an integer
Post by Jonathan Kew
parameter; set it to 1 to get ActualText added to the PDF, for better
copy/paste and search in Acrobat) is now on sourceforge, in an "actualtext"
branch, for anyone who wants to try building and experimenting with it.
In the case where \XeTeXgenerateactualtext=1, I can not
select a full text by Adobe Acrobat XI Pro on Windows.
I can select about half of a text.
Does it depend on a reader?
Copied text is OK, that is a full text.
See and test with an attached actest.zip.
Thanks,
Akira
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
Jonathan Kew
2016-02-23 17:11:16 UTC
Permalink
Post by Zdenek Wagner
Hi Akira,
I have a similar problem in Linux. As Jonathan wrote, highlighting is
quite weird but the result is OK.
Yes, that's what I meant about highlighting not working very well. I
don't know yet whether there's something we can do when generating the
PDF to make it work better, or if this is simply a bug in Acrobat's
display.
Post by Zdenek Wagner
Zdeněk Wagner
http://ttsm.icpf.cas.cz/team/wagner.shtml
http://icebearsoft.euweb.cz
Hi Honathan,
The code for the \XeTeXgenerateactualtext feature (it's an
integer parameter; set it to 1 to get ActualText added to the
PDF, for better copy/paste and search in Acrobat) is now on
sourceforge, in an "actualtext" branch, for anyone who wants to
try building and experimenting with it.
In the case where \XeTeXgenerateactualtext=1, I can not
select a full text by Adobe Acrobat XI Pro on Windows.
I can select about half of a text.
Does it depend on a reader?
Copied text is OK, that is a full text.
See and test with an attached actest.zip.
Thanks,
Akira
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
Philip Taylor
2016-02-23 17:39:06 UTC
Permalink
Using Akira-san's "actest.pdf" as sample, Adobe Acrobat Pro 7.1 allows
me to select only half of the text whereas Adobe Reader DC allows me to
select it all; neither allows me to select individual kanji.

** Phil.


--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex
Jonathan Kew
2016-02-23 18:00:17 UTC
Permalink
Post by Philip Taylor
Using Akira-san's "actest.pdf" as sample, Adobe Acrobat Pro 7.1 allows
me to select only half of the text whereas Adobe Reader DC allows me to
select it all; neither allows me to select individual kanji.
Ah, right... as there are no spaces between the kanji, they'll end up in
the same text object. That's a shortcoming of how the current
implementation works, for scripts that don't use inter-word spaces.

In either case, copy&paste actually gives you the whole text, even
though AAPro only highlights half of it, I guess?

JK



--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex
Philip Taylor
2016-02-23 18:18:59 UTC
Permalink
Post by Jonathan Kew
In either case, copy&paste actually gives you the whole text, even
though AAPro only highlights half of it, I guess?
Yes, six (consecutive) instances of 日本国憲法
** Phil.


--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listin
Andrew Cunningham
2016-02-23 21:40:04 UTC
Permalink
Is it copying actualtext or the text layer?

A beeter test would be one of the complex scripts.

Andrew
Post by Jonathan Kew
In either case, copy&paste actually gives you the whole text, even
though AAPro only highlights half of it, I guess?
Yes, six (consecutive) instances of 日本囜憲法
** Phil.
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
--
Andrew Cunningham
***@gmail.com
ShreeDevi Kumar
2016-02-24 09:22:09 UTC
Permalink
Testing dev-actualtext.pdf sent by JK


- Adobe Acrobat Reader XI on Windows 10
- Does not highlight text fully
- SEARCH finds words and word parts correctly but usually highlights
only beginning of the word containing the letter
- COPY paste to NOTEPAD++, OPENOFFICE WRITER works correctly,
- Save as TXT file does not work correctly - only saves ... in it,
not the actual unicode text which can be copied
- Foxit Reader 7.3 on Windows 10
- Highlights text fully,
- smallest highlight unit is word,
- COPY paste to notepad++ as well as SEARCH does NOT work correctly as
Unicode text is not fully correct.

à¥‚à€¯

à€¿à€šà€•à¥‹à€¡ à€•à¥à€¯à€Ÿ à€¹ ? ै


- ​Save as TXT file does not work correctly - saves the unicode text
with same problems as in copy and paste​

- ​Microsoft Edge Viewer on Windows 10
- ​
Highlights text fully,
- COPY paste to notepad++ as well as SEARCH does NOT work correctly
as Unicode text is not fully correct.

à€¯ à¥‚à€¿à€šà€•à¥‹à€¡ à€•à¥à€¯à€Ÿ à€¹à¥ˆ?

- ​
Previewing from within gmail in Chrome on Windows 10 -
- Highlights text fully,
- smallest highlight unit is word,
- COPY paste to NOTEPAD++, OPENOFFICE WRITER works correctly,
- (highlights only first letter of first word in paragraph à€¯à¥‚ rather
than full word à€¯à¥‚à€šà€¿à€•à¥‹à€¡)
- there is NO SEARCH feature
- there is no save as TXT file feature
- Same as above while Previewing from within gmail in Internet Explorer
on Windows 10




ShreeDevi
____________________________________________________________
Post by Jonathan Kew
Post by Philip Taylor
Using Akira-san's "actest.pdf" as sample, Adobe Acrobat Pro 7.1 allows
me to select only half of the text whereas Adobe Reader DC allows me to
select it all; neither allows me to select individual kanji.
Ah, right... as there are no spaces between the kanji, they'll end up in
the same text object. That's a shortcoming of how the current
implementation works, for scripts that don't use inter-word spaces.
In either case, copy&paste actually gives you the whole text, even though
AAPro only highlights half of it, I guess?
JK
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
Jonathan Kew
2016-02-24 10:06:42 UTC
Permalink
Post by ShreeDevi Kumar
Testing dev-actualtext.pdf sent by JK
* Adobe Acrobat Reader XI on Windows 10
o Does not highlight text fully
o SEARCH finds words and word parts correctly but usually
highlights only beginning of the word containing the letter
o COPY paste to NOTEPAD++, OPENOFFICE WRITER works correctly,
o Save as TXT file does not work correctly - only saves ... in it,
not the actual unicode text which can be copied
So it looks like Acrobat makes use of the ActualText for Search and
Copy, but sadly its "Save as Text" doesn't support Unicode.

I'm pleasantly surprised to see the Gmail previewer also handles it.

The others (Foxit, Edge) sound like they're just working from the glyph
stream, which is basically doomed to failure.

For a further data point, I tried Evince (Document Viewer) on Ubuntu
15.10, and found that Copy and Search work well; it looks like it is
using the ActualText correctly. This is thanks to the poppler library, I
believe. The (poppler-based) "pdftotext" tool was also able to extract
the Unicode text correctly from the PDF, although "pdftohtml" didn't do
so well.

One issue with Evince is that drag-selecting text to highlight it (as
for Copy/Paste) looks bad: the highlighting completely obscures the
selected text, although it will end up being copied correctly.
Interestingly, its highlighting of search results doesn't suffer from
this problem, and it even makes a fair attempt (not completely accurate)
at highlighting specific letters within a word, not just entire words.

JK
Post by ShreeDevi Kumar
* Foxit Reader 7.3 on Windows 10
o Highlights text fully,
o smallest highlight unit is word,
o COPY paste to notepad++ as well as SEARCH does NOT work
correctly as Unicode text is not fully correct.
ूय
िनकोड क्या ह ? ै
o
​Save as TXT file does not work correctly - saves the unicode
text with same problems as in copy and paste​
*
​Microsoft Edge Viewer on Windows 10
o

Highlights text fully,
o COPY paste to notepad++ as well as SEARCH does NOT work
correctly as Unicode text is not fully correct.
य ूिनकोड क्या है?
*

Previewing from within gmail in Chrome on Windows 10 -
o Highlights text fully,
o smallest highlight unit is word,
o COPY paste to NOTEPAD++, OPENOFFICE WRITER works correctly,
o (highlights only first letter of first word in
paragraph यू rather than full word यूनिकोड)
o there is NO SEARCH feature
o there is no save as TXT file feature
* Same as above while Previewing from within gmail in Internet
Explorer on Windows 10
ShreeDevi
____________________________________________________________
Using Akira-san's "actest.pdf" as sample, Adobe Acrobat Pro 7.1 allows
me to select only half of the text whereas Adobe Reader DC allows me to
select it all; neither allows me to select individual kanji.
Ah, right... as there are no spaces between the kanji, they'll end
up in the same text object. That's a shortcoming of how the current
implementation works, for scripts that don't use inter-word spaces.
In either case, copy&paste actually gives you the whole text, even
though AAPro only highlights half of it, I guess?
JK
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
--------------------------------------------------
Subscriptions, Archive, and List in
ShreeDevi Kumar
2016-02-25 05:11:27 UTC
Permalink
Jonathan,

This is a really useful feature and I look forward to using it once it is
released in TLY2016.

Since how well the search and copy paste features work could also be font
dependent, I would like to test some more PDFs in unicode devanagari
created by this new feature using other fonts. I usually use Siddhanta and
Sanskrit2003 font.

I would appreciate if you or other members who have this feature installed
can provide a few more sample PDFs in devanagari for testing.

Thanks!

- sent from my phone. excuse the brevity.
Post by ShreeDevi Kumar
Testing dev-actualtext.pdf sent by JK
* Adobe Acrobat Reader XI on Windows 10
o Does not highlight text fully
o SEARCH finds words and word parts correctly but usually
highlights only beginning of the word containing the letter
o COPY paste to NOTEPAD++, OPENOFFICE WRITER works correctly,
o Save as TXT file does not work correctly - only saves ... in it,
not the actual unicode text which can be copied
So it looks like Acrobat makes use of the ActualText for Search and Copy,
but sadly its "Save as Text" doesn't support Unicode.
I'm pleasantly surprised to see the Gmail previewer also handles it.
The others (Foxit, Edge) sound like they're just working from the glyph
stream, which is basically doomed to failure.
For a further data point, I tried Evince (Document Viewer) on Ubuntu
15.10, and found that Copy and Search work well; it looks like it is using
the ActualText correctly. This is thanks to the poppler library, I believe.
The (poppler-based) "pdftotext" tool was also able to extract the Unicode
text correctly from the PDF, although "pdftohtml" didn't do so well.
One issue with Evince is that drag-selecting text to highlight it (as for
Copy/Paste) looks bad: the highlighting completely obscures the selected
text, although it will end up being copied correctly. Interestingly, its
highlighting of search results doesn't suffer from this problem, and it
even makes a fair attempt (not completely accurate) at highlighting
specific letters within a word, not just entire words.
JK
* Foxit Reader 7.3 on Windows 10
Post by ShreeDevi Kumar
o Highlights text fully,
o smallest highlight unit is word,
o COPY paste to notepad++ as well as SEARCH does NOT work
correctly as Unicode text is not fully correct.
à¥‚à€¯
à€¿à€šà€•à¥‹à€¡ à€•à¥à€¯à€Ÿ à€¹ ? ै
o
​Save as TXT file does not work correctly - saves the unicode
text with same problems as in copy and paste​
*
​Microsoft Edge Viewer on Windows 10
o
​
Highlights text fully,
o COPY paste to notepad++ as well as SEARCH does NOT work
correctly as Unicode text is not fully correct.
à€¯ à¥‚à€¿à€šà€•à¥‹à€¡ à€•à¥à€¯à€Ÿ à€¹à¥ˆ?
*
​
Previewing from within gmail in Chrome on Windows 10 -
o Highlights text fully,
o smallest highlight unit is word,
o COPY paste to NOTEPAD++, OPENOFFICE WRITER works correctly,
o (highlights only first letter of first word in
paragraph à€¯à¥‚ rather than full word à€¯à¥‚à€šà€¿à€•à¥‹à€¡)
o there is NO SEARCH feature
o there is no save as TXT file feature
* Same as above while Previewing from within gmail in Internet
Explorer on Windows 10
ShreeDevi
____________________________________________________________
Using Akira-san's "actest.pdf" as sample, Adobe Acrobat Pro 7.1 allows
me to select only half of the text whereas Adobe Reader DC allows me to
select it all; neither allows me to select individual kanji.
Ah, right... as there are no spaces between the kanji, they'll end
up in the same text object. That's a shortcoming of how the current
implementation works, for scripts that don't use inter-word spaces.
In either case, copy&paste actually gives you the whole text, even
though AAPro only highlights half of it, I guess?
JK
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
Jonathan Kew
2016-02-27 16:24:35 UTC
Permalink
Post by Adam Twardoch (List)
Jonathan,
This is a really useful feature and I look forward to using it once it
is released in TLY2016.
Since how well the search and copy paste features work could also be
font dependent, I would like to test some more PDFs in unicode
devanagari created by this new feature using other fonts. I usually use
Siddhanta and Sanskrit2003 font.
Search and copy/paste behave in highly font-dependent ways for
applications that do NOT support ActualText (or for PDFs that do not
include it), but for applications that support the ActualText feature,
this should be independent of the font being used.

If you'd like to try it with the Siddhanta and Sanskrit2003 (or other)
fonts using Akira's builds, and confirm whether it works for you, that
would be great. I'd like to merge the code to the xetex master branch in
the next day or two, if no-one reports problems with it.

(To solve the "xelatex.fmt doesn't match xetex.pool" problem, rebuild
the xetex formats with fmtutil, as Zdeněk indicated. Or just delete the
old xelatex.fmt file, and I expect it should be re-created as needed on
the next run.)

JK
Post by Adam Twardoch (List)
I would appreciate if you or other members who have this feature
installed can provide a few more sample PDFs in devanagari for testing.
Thanks!
- sent from my phone. excuse the brevity.
--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
Apostolos Syropoulos
2016-02-27 16:40:51 UTC
Permalink
Post by Jonathan Kew
Search and copy/paste behave in highly font-dependent ways for
applications that do NOT support ActualText (or for PDFs that do not
include it), but for applications that support the ActualText feature,
this should be independent of the font being used.
Once I was trying to find a word in a PDF file. The first letters of the word
were Th but I could not find any word starting with a Th. The reason?
The Th in the PDF was actually a ligature... I am sure that this ActualText
feature will solve such problems.


Thank you!


A.S.

----------------------
Apostolos Syropoulos
Xanthi, Greece


--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex
Philip Taylor
2016-02-27 17:04:46 UTC
Permalink
Post by Jonathan Kew
If you'd like to try it with the Siddhanta and Sanskrit2003 (or other)
fonts using Akira's builds, and confirm whether it works for you, that
would be great. I'd like to merge the code to the xetex master branch in
the next day or two, if no-one reports problems with it.
No problems encountered here, Jonathan, when using Akira-san's build on
my current project.

** Phil.




--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex
ShreeDevi Kumar
2016-02-27 17:49:16 UTC
Permalink
Thank you for clarifying that this feature is independent of the font being
used.

I was able to use it with the Siddhanta and Sanskrit2003 fonts using
Akira's builds and it is working similar to the earlier pdf.


ShreeDevi
____________________________________________________________
Post by Jonathan Kew
Post by Adam Twardoch (List)
Jonathan,
This is a really useful feature and I look forward to using it once it
is released in TLY2016.
Since how well the search and copy paste features work could also be
font dependent, I would like to test some more PDFs in unicode
devanagari created by this new feature using other fonts. I usually use
Siddhanta and Sanskrit2003 font.
Search and copy/paste behave in highly font-dependent ways for
applications that do NOT support ActualText (or for PDFs that do not
include it), but for applications that support the ActualText feature, this
should be independent of the font being used.
If you'd like to try it with the Siddhanta and Sanskrit2003 (or other)
fonts using Akira's builds, and confirm whether it works for you, that
would be great. I'd like to merge the code to the xetex master branch in
the next day or two, if no-one reports problems with it.
(To solve the "xelatex.fmt doesn't match xetex.pool" problem, rebuild the
xetex formats with fmtutil, as Zdeněk indicated. Or just delete the old
xelatex.fmt file, and I expect it should be re-created as needed on the
next run.)
JK
Post by Adam Twardoch (List)
I would appreciate if you or other members who have this feature
installed can provide a few more sample PDFs in devanagari for testing.
Thanks!
- sent from my phone. excuse the brevity.
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
ShreeDevi Kumar
2016-02-28 02:41:49 UTC
Permalink
Some more feedback regarding devanagari pdf with Actualtext
Post by Jonathan Kew
Post by Adam Twardoch (List)
On a mac this worked with the PDF Expert app, but not with the default
Finder app (the later works with simple text such as “atha”, but not when a
letter combined with any other vowel or consonant such as “ruu” or “sva”.

......

I am hoping that once the XeTeXgenerateactualtext feature is implemented,
it would lead to more pdf apps supporting actualtext.

- sent from my phone. excuse the brevity.
Post by Jonathan Kew
Post by Adam Twardoch (List)
Jonathan,
This is a really useful feature and I look forward to using it once it
is released in TLY2016.
Since how well the search and copy paste features work could also be
font dependent, I would like to test some more PDFs in unicode
devanagari created by this new feature using other fonts. I usually use
Siddhanta and Sanskrit2003 font.
Search and copy/paste behave in highly font-dependent ways for
applications that do NOT support ActualText (or for PDFs that do not
include it), but for applications that support the ActualText feature, this
should be independent of the font being used.
If you'd like to try it with the Siddhanta and Sanskrit2003 (or other)
fonts using Akira's builds, and confirm whether it works for you, that
would be great. I'd like to merge the code to the xetex master branch in
the next day or two, if no-one reports problems with it.
(To solve the "xelatex.fmt doesn't match xetex.pool" problem, rebuild the
xetex formats with fmtutil, as Zdeněk indicated. Or just delete the old
xelatex.fmt file, and I expect it should be re-created as needed on the
next run.)
JK
Post by Adam Twardoch (List)
I would appreciate if you or other members who have this feature
installed can provide a few more sample PDFs in devanagari for testing.
Thanks!
- sent from my phone. excuse the brevity.
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
Akira Kakuto
2016-02-29 02:10:42 UTC
Permalink
Post by Philip Taylor
Using Akira-san's "actest.pdf" as sample, Adobe Acrobat Pro 7.1 allows
me to select only half of the text whereas Adobe Reader DC allows me to
select it all; neither allows me to select individual kanji.
I have found that 'SumatraPDF' for windows can select, copy&paste
even an individual kanji in XeTeX 0.99995 with \XeTeXgenerateactualtext=1.
SumatraPDF is better than Adobe Acrobat XI Pro with respect to /ActualText.

Best,
Akira



--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex
ShreeDevi Kumar
2016-02-29 06:59:13 UTC
Permalink
SumatraPDF on windows 10 was similar to Foxit Reader - using the glyph
stream instead of the actualtext - in case of devanagari text.

ShreeDevi
____________________________________________________________
Post by Philip Taylor
Using Akira-san's "actest.pdf" as sample, Adobe Acrobat Pro 7.1 allows
Post by Philip Taylor
me to select only half of the text whereas Adobe Reader DC allows me to
select it all; neither allows me to select individual kanji.
I have found that 'SumatraPDF' for windows can select, copy&paste
even an individual kanji in XeTeX 0.99995 with \XeTeXgenerateactualtext=1.
SumatraPDF is better than Adobe Acrobat XI Pro with respect to /ActualText.
Best,
Akira
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
Akira Kakuto
2016-02-29 07:58:16 UTC
Permalink
Dear ShreeDevi Kumar,
SumatraPDF on windows 10 was similar to Foxit Reader - using the glyph stream
instead of the actualtext - in case of devanagari text.
Thanks a lot for correcting my mistake.

Best,
Akira



--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex
Akira Kakuto
2016-02-29 03:18:38 UTC
Permalink
Post by Jonathan Kew
Ah, right... as there are no spaces between the kanji, they'll end up in
the same text object.
If I set,

\XeTeXlinebreaklocale="ja-JP"
\XeTeXlinebreakskip=0pt plus 1pt minus 0pt

even an individual kanji can be 'selected', 'copied' and 'pasted',
even in the case of Adobe Acrobat XI Pro.

Best,
Akira



--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex
Akira Kakuto
2016-02-25 06:34:29 UTC
Permalink
Post by Jonathan Kew
The code for the \XeTeXgenerateactualtext feature (it's an integer
parameter; set it to 1 to get ActualText added to the PDF, for better
copy/paste and search in Acrobat) is now on sourceforge, in an
"actualtext" branch, for anyone who wants to try building and
experimenting with it.
Windows 32bit binary for tests based on Jonathan's 845506
is available in:
http://members2.jcom.home.ne.jp/wt1357ak/xetex-ac-txt.zip

I'll remove the file in due time.

Best,
Akira



--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex
ShreeDevi Kumar
2016-02-27 10:18:55 UTC
Permalink
Hello Akira,

Help needed ...

I downloaded and installed texlive 2015 (on a windows 7 netbook) and then
copied the 6 binary files from your zip file.

However, now when I run xelatex with texworks I am getting the error

This is XeTeX, Version 3.14159265-2.6-0.99995 (TeX Live 2016/W32TeX/dev)
(preloaded format=xelatex)

restricted \write18 enabled.

---! c:/texlive/2015/texmf-var/web2c/xetex/xelatex.fmt doesn't match
xetex.pool

(Fatal format file error; I'm stymied)


When I try to run the package manager under Tex Live Manager, nothing
happens - I do not get the package manager window to update.


Any suggestions on what I need to do to fix this.


Thanks!

ShreeDevi
____________________________________________________________
Post by Jonathan Kew
The code for the \XeTeXgenerateactualtext feature (it's an integer
Post by Jonathan Kew
parameter; set it to 1 to get ActualText added to the PDF, for better
copy/paste and search in Acrobat) is now on sourceforge, in an "actualtext"
branch, for anyone who wants to try building and experimenting with it.
Windows 32bit binary for tests based on Jonathan's 845506
http://members2.jcom.home.ne.jp/wt1357ak/xetex-ac-txt.zip
I'll remove the file in due time.
Best,
Akira
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
Zdenek Wagner
2016-02-27 10:25:02 UTC
Permalink
Post by ShreeDevi Kumar
Hello Akira,
Help needed ...
I downloaded and installed texlive 2015 (on a windows 7 netbook) and then
copied the 6 binary files from your zip file.
However, now when I run xelatex with texworks I am getting the error
This is XeTeX, Version 3.14159265-2.6-0.99995 (TeX Live 2016/W32TeX/dev)
(preloaded format=xelatex)
restricted \write18 enabled.
---! c:/texlive/2015/texmf-var/web2c/xetex/xelatex.fmt doesn't match
xetex.pool
(Fatal format file error; I'm stymied)
fmtutil-sys --byengine xetex

(You have to regenerate the formats for XeTeX)
Post by ShreeDevi Kumar
When I try to run the package manager under Tex Live Manager, nothing
happens - I do not get the package manager window to update.
Any suggestions on what I need to do to fix this.
Thanks!
ShreeDevi
Zdeněk Wagner
http://ttsm.icpf.cas.cz/team/wagner.shtml
http://icebearsoft.euweb.cz
Post by ShreeDevi Kumar
____________________________________________________________
Post by Jonathan Kew
The code for the \XeTeXgenerateactualtext feature (it's an integer
Post by Jonathan Kew
parameter; set it to 1 to get ActualText added to the PDF, for better
copy/paste and search in Acrobat) is now on sourceforge, in an "actualtext"
branch, for anyone who wants to try building and experimenting with it.
Windows 32bit binary for tests based on Jonathan's 845506
http://members2.jcom.home.ne.jp/wt1357ak/xetex-ac-txt.zip
I'll remove the file in due time.
Best,
Akira
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
ShreeDevi Kumar
2016-02-27 17:43:30 UTC
Permalink
Thanks.
I reinstalled texlive 2015 again and now the TL package manager gui is
working and I was able to regenerate the formats.

ShreeDevi
____________________________________________________________
Post by Zdenek Wagner
Post by ShreeDevi Kumar
Hello Akira,
Help needed ...
I downloaded and installed texlive 2015 (on a windows 7 netbook) and then
copied the 6 binary files from your zip file.
However, now when I run xelatex with texworks I am getting the error
This is XeTeX, Version 3.14159265-2.6-0.99995 (TeX Live 2016/W32TeX/dev)
(preloaded format=xelatex)
restricted \write18 enabled.
---! c:/texlive/2015/texmf-var/web2c/xetex/xelatex.fmt doesn't match
xetex.pool
(Fatal format file error; I'm stymied)
fmtutil-sys --byengine xetex
(You have to regenerate the formats for XeTeX)
Post by ShreeDevi Kumar
When I try to run the package manager under Tex Live Manager, nothing
happens - I do not get the package manager window to update.
Any suggestions on what I need to do to fix this.
Thanks!
ShreeDevi
Zdeněk Wagner
http://ttsm.icpf.cas.cz/team/wagner.shtml
http://icebearsoft.euweb.cz
Post by ShreeDevi Kumar
____________________________________________________________
Post by Jonathan Kew
The code for the \XeTeXgenerateactualtext feature (it's an integer
Post by Jonathan Kew
parameter; set it to 1 to get ActualText added to the PDF, for better
copy/paste and search in Acrobat) is now on sourceforge, in an "actualtext"
branch, for anyone who wants to try building and experimenting with it.
Windows 32bit binary for tests based on Jonathan's 845506
http://members2.jcom.home.ne.jp/wt1357ak/xetex-ac-txt.zip
I'll remove the file in due time.
Best,
Akira
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
Akira Kakuto
2016-02-29 12:14:07 UTC
Permalink
Post by Akira Kakuto
SumatraPDF on windows 10 was similar to Foxit Reader - using the glyph stream
instead of the actualtext - in case of devanagari text.
Thanks a lot for correcting my mistake.
I confirmed by using accsupp package by Heiko that
SumatraPDF on windows does not use the ActualText for
"select, copy&paste", but use the glyph stream.
Thank you again ShreeDevi Kumar.

Best,
Akira



--------------------------------------------------
Subscriptions, Archive, and List information, etc.:
http://tug.org/mailman/listinfo/xetex
ShreeDevi Kumar
2016-02-29 16:23:45 UTC
Permalink
Hello Akira,

I was only reporting the result of my test with sumatrapdf with the pdf
with devanagari text.

Jonathan had mentioned in an earlier msg that Foxit etc were using the
Glyph stream instead of Actualtext. Since Sumatrapdf gave similar result
for devanagari, I mentioned the same.

Regards,
Shree

- sent from my phone. excuse the brevity.
Post by ShreeDevi Kumar
SumatraPDF on windows 10 was similar to Foxit Reader - using the glyph
Post by Jonathan Kew
stream
instead of the actualtext - in case of devanagari text.
Thanks a lot for correcting my mistake.
I confirmed by using accsupp package by Heiko that
SumatraPDF on windows does not use the ActualText for
"select, copy&paste", but use the glyph stream.
Thank you again ShreeDevi Kumar.
Best,
Akira
--------------------------------------------------
http://tug.org/mailman/listinfo/xetex
Loading...