Discussion:
How to convert .doc to plain text ascii in emacs.
Don Saklad
2004-04-28 17:32:54 UTC
Permalink
What related emacs commands are there that might convert an rmail
attachment from .doc to plain text ascii ?...

It is an rmail message distributed from local government about an
upcoming public hearing.
Don Saklad
2004-04-28 18:02:40 UTC
Permalink
What related emacs commands are there that might convert an rmail
attachment from .doc to plain text ascii ?...

It is an rmail message distributed from local government about an
upcoming public hearing. For example, here are some parts of a
specimen message...




This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

------_=_NextPart_000_01C42C97.793C3810
Content-Type: text/plain;
charset="iso-8859-1"


Education Hearings
Please note some of the Education hearing will be held within the Ways and
Means Budget Hearings
<<HN-CNS Edu Dockets #0378 May 10, 2004 discuss school dept district wide
student success plan.doc>> <<HN-CNS Edu Dockets #0251 May 10, 2004 hearing
regarding MCAS.doc>> <<HN-CNS Edu Dockets #0375, 0374, 0376 May 6, 2004
physcial edu prog, nutrition curriculum and staff training, after school
athletic programs.doc>> <<HN-CNS Edu Dockets #0266 and 0369 May 4, 2004
mildred cntr unified facilities plan.doc>> <<HN-CNS Edu Dockets #0489 May
6, 2004 discuss student civic life.doc>> <<HN-CNS Arts, Film, Humanities
and Tourism Docket #0486 May 4, 2004 on the Strand Theater.doc>>

------_=_NextPart_000_01C42C97.793C3810
Content-Type: application/msword;
name="HN-CNS Edu Dockets #0378 May 10, 2004 discuss school dept district wide student success plan.doc"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="HN-CNS Edu Dockets #0378 May 10, 2004 discuss school dept district wide student success plan.doc"

0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAPgADAP7/CQAGAAAAAAAAAAAAAAACAAAAiQAAAAAAAAAA
EAAAiwAAAAEAAAD+////AAAAAIcAAACIAAAA////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////




------_=_NextPart_000_01C42C97.793C3810
Content-Type: application/msword;
name="HN-CNS Edu Dockets #0251 May 10, 2004 hearing regarding MCAS.doc"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="HN-CNS Edu Dockets #0251 May 10, 2004 hearing regarding MCAS.doc"

0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAPgADAP7/CQAGAAAAAAAAAAAAAAACAAAAiQAAAAAAAAAA
EAAAiwAAAAEAAAD+////AAAAAIcAAACIAAAA////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////


------_=_NextPart_000_01C42C97.793C3810--
Kin Cho
2004-04-28 18:10:14 UTC
Permalink
Save the attachment, then google. The answer depends on your
platform.

Also ask the sender to provide a pdf version of the document.

-kin
Post by Don Saklad
What related emacs commands are there that might convert an rmail
attachment from .doc to plain text ascii ?...
It is an rmail message distributed from local government about an
upcoming public hearing. For example, here are some parts of a
specimen message...
This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.
------_=_NextPart_000_01C42C97.793C3810
Content-Type: text/plain;
charset="iso-8859-1"
Education Hearings
Please note some of the Education hearing will be held within the Ways and
Means Budget Hearings
<<HN-CNS Edu Dockets #0378 May 10, 2004 discuss school dept district wide
student success plan.doc>> <<HN-CNS Edu Dockets #0251 May 10, 2004 hearing
regarding MCAS.doc>> <<HN-CNS Edu Dockets #0375, 0374, 0376 May 6, 2004
physcial edu prog, nutrition curriculum and staff training, after school
athletic programs.doc>> <<HN-CNS Edu Dockets #0266 and 0369 May 4, 2004
mildred cntr unified facilities plan.doc>> <<HN-CNS Edu Dockets #0489 May
6, 2004 discuss student civic life.doc>> <<HN-CNS Arts, Film, Humanities
and Tourism Docket #0486 May 4, 2004 on the Strand Theater.doc>>
------_=_NextPart_000_01C42C97.793C3810
Content-Type: application/msword;
name="HN-CNS Edu Dockets #0378 May 10, 2004 discuss school dept district wide student success plan.doc"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="HN-CNS Edu Dockets #0378 May 10, 2004 discuss school dept district wide student success plan.doc"
0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAPgADAP7/CQAGAAAAAAAAAAAAAAACAAAAiQAAAAAAAAAA
EAAAiwAAAAEAAAD+////AAAAAIcAAACIAAAA////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
------_=_NextPart_000_01C42C97.793C3810
Content-Type: application/msword;
name="HN-CNS Edu Dockets #0251 May 10, 2004 hearing regarding MCAS.doc"
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename="HN-CNS Edu Dockets #0251 May 10, 2004 hearing regarding MCAS.doc"
0M8R4KGxGuEAAAAAAAAAAAAAAAAAAAAAPgADAP7/CQAGAAAAAAAAAAAAAAACAAAAiQAAAAAAAAAA
EAAAiwAAAAEAAAD+////AAAAAIcAAACIAAAA////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
////////////////////////////////////////////////////////////////////////////
------_=_NextPart_000_01C42C97.793C3810--
Jay Belanger
2004-04-28 18:17:24 UTC
Permalink
Post by Kin Cho
Save the attachment, then google. The answer depends on your
platform.
There's undoc.el (http://www.ccs.neu.edu/home/guttman/undoc.el),
which is written in elisp.
Post by Kin Cho
Also ask the sender to provide a pdf version of the document.
The better solution, of course.

Jay
John Russell
2004-04-29 13:47:48 UTC
Permalink
Jay Belanger <***@truman.edu> writes:

I find the the strings command helps if you just need to know what is
in a doc file, but of course...
Post by Jay Belanger
Post by Kin Cho
Also ask the sender to provide a pdf version of the document.
The better solution, of course.
Jay
Yoni Rabkin Katzenell
2004-04-28 18:13:18 UTC
Permalink
Post by Don Saklad
What related emacs commands are there that might convert an rmail
attachment from .doc to plain text ascii ?...
It is an rmail message distributed from local government about an
upcoming public hearing.
I heard that Antiword [http://www.winfield.demon.nl/] can convert doc
files into plain text and postscript. Note though, that I have never
used the software myself.

I would also be good to inform the body in question that they are making
a grave mistake and effectively endorsing a commercial product by
forcing people to purchase Microsoft Windows in order to take part in
governmental activities.

More on word attachments at:
[http://www.gnu.org/philosophy/no-word-attachments.html]
--
"Cut your own wood and it will warm you twice"
Regards, Yoni Rabkin Katzenell
Thomas Persson
2004-05-01 19:02:44 UTC
Permalink
Post by Yoni Rabkin Katzenell
Post by Don Saklad
What related emacs commands are there that might convert an rmail
attachment from .doc to plain text ascii ?...
It is an rmail message distributed from local government about an
upcoming public hearing.
I heard that Antiword [http://www.winfield.demon.nl/] can convert doc
files into plain text and postscript. Note though, that I have never
used the software myself.
I use antiword and the following code to integrate it with emacs:

(defun antiword-buffer ()
"Takes the current buffer as input to the external program antiword.

If the current buffer is a ms-word document it's contents are replaced
with the output from antiword and the extension `.doc' is replaced
with `.txt' in the buffer-file-name."
(let ((txt-buffer-file-name (concat (substring (buffer-file-name) 0 -4)
".txt")))
(shell-command-on-region (point-min) (point-max)
"cat | antiword -" nil t nil)
(undo-start)
(if (equal (buffer-string) "- is not a Word Document.\n")
(or (undo-more 1)
(message "%s - is not a Word Document."(current-buffer)))
(set-visited-file-name txt-buffer-file-name)
(not-modified))))

;; The following expression makes sure that antiword-buffer is run when a
;; file with the .doc extension is opened.
(setq auto-mode-alist
(append '(("\\.doc\\'" . antiword-buffer))
auto-mode-alist))
g***@speakeasy.net
2004-05-02 14:44:17 UTC
Permalink
Thanks very much. Your elisp works great. There's one glitch (which I
realize is from antiword):

The three characters "\342\200\231" should be replaced by the single
apostrophe character ('). To do this by hand, I did

M-x replace-regexp Return C-q 342 Return C-q 200 Return C-q 231 Return
Return ' Return

but this does not find the intended string. The problem seems to be
that C-q 342 is immediately (in the minibuffer) converted into an 'a'
with a grave symbol over it. Putting the point on the backslash (\)
preceding the 342 in the antiword-converted buffer and doing "C-u C-x ="
indeed shows this a-with-grave character to be (0342, 226, 0xe2).

To create a simple test case, do the following:

Open an empty *scratch* buffer. Enter into it: C-q 342 Return C-q 200
Return C-q 231 Return. The first character that appears is the
a-with-grave; the second and third characters appear properly as
\200\231.

It is, I think, the failure of C-q 342 to be represented as \342 which
is the problem. What is the solution?


tia,
ken



[....]
R***@core.com
2004-05-02 19:04:12 UTC
Permalink
Post by g***@speakeasy.net
Thanks very much. Your elisp works great. There's one glitch (which I
The three characters "\342\200\231" should be replaced by the single
apostrophe character ('). To do this by hand, I did
M-x replace-regexp Return C-q 342 Return C-q 200 Return C-q 231 Return
Return ' Return
but this does not find the intended string. The problem seems to be
that C-q 342 is immediately (in the minibuffer) converted into an 'a'
with a grave symbol over it. Putting the point on the backslash (\)
preceding the 342 in the antiword-converted buffer and doing "C-u C-x ="
indeed shows this a-with-grave character to be (0342, 226, 0xe2).
Open an empty *scratch* buffer. Enter into it: C-q 342 Return C-q 200
Return C-q 231 Return. The first character that appears is the
a-with-grave; the second and third characters appear properly as
\200\231.
It is, I think, the failure of C-q 342 to be represented as \342 which
is the problem. What is the solution?
tia,
ken
Have you tried just copying and pasting the character into the minibuffer
when doing the replace-regexp?

--Rod

__________

Author of "Linux for Non-Geeks--Clear-eyed Answers for Practical Consumers"
and "Boring Stories from Uncle Rod." Both are available at
http://www.rodwriterpublishing.com/index.html

To reply by e-mail, take the extra "o" out of the name.
Thomas Persson
2004-05-02 19:26:45 UTC
Permalink
Post by g***@speakeasy.net
Thanks very much. Your elisp works great. There's one glitch (which I
The three characters "\342\200\231" should be replaced by the single
apostrophe character (').
The fact that antiword and my code leaves you with a buffer containing
numerical codes instead of the characters themselves is your first
problem. This doesn't happen for me at all. It's either a problem with
antiword or a problem with how emacs displays characters. Try running
antiword from the command line to figure out which.
Post by g***@speakeasy.net
To do this by hand, I did M-x replace-regexp Return C-q 342 Return
C-q 200 Return C-q 231 Return Return ' Return
but this does not find the intended string. The problem seems to be
that C-q 342 is immediately (in the minibuffer) converted into an 'a'
with a grave symbol over it. Putting the point on the backslash (\)
preceding the 342 in the antiword-converted buffer and doing "C-u C-x ="
indeed shows this a-with-grave character to be (0342, 226, 0xe2).
Open an empty *scratch* buffer. Enter into it: C-q 342 Return C-q 200
Return C-q 231 Return. The first character that appears is the
a-with-grave; the second and third characters appear properly as
\200\231.
It is, I think, the failure of C-q 342 to be represented as \342 which
is the problem. What is the solution?
The fact that you have a problem with replacing the numerical
character codes with the characters themselves is however definitely a
emacs related problem. As far as I can tell it would work to add the
replace-regexp business to the end of the antiword-buffer function
like this:


(defun antiword-buffer ()
"Takes the current buffer as input to the external program antiword.

If the current buffer is a ms-word document it's contents are replaced
with the output from antiword and the extension `.doc' is replaced
with `.txt' in the buffer-file-name."
(let ((txt-buffer-file-name (concat (substring (buffer-file-name) 0 -4)
".txt")))
(shell-command-on-region (point-min) (point-max)
"cat | antiword -" nil t nil)
(undo-start)
(if (equal (buffer-string) "- is not a Word Document.\n")
(or (undo-more 1)
(message "%s - is not a Word Document."(current-buffer)))
(set-visited-file-name txt-buffer-file-name)
(not-modified)
(replace-regexp "\342\200\231" "'"))))

;; The following expression makes sure that antiword-buffer is run when a
;; file with the .doc extension is opened.
(setq auto-mode-alist
(append '(("\\.doc\\'" . antiword-buffer))
auto-mode-alist))


If that doesn't work then perhaps "wvWare" or "undoc.el" ,as previous
posters have suggested, might be better solutions for you.

g***@speakeasy.net
2004-05-02 14:44:17 UTC
Permalink
Thanks very much. Your elisp works great. There's one glitch (which I
realize is from antiword):

The three characters "\342\200\231" should be replaced by the single
apostrophe character ('). To do this by hand, I did

M-x replace-regexp Return C-q 342 Return C-q 200 Return C-q 231 Return
Return ' Return

but this does not find the intended string. The problem seems to be
that C-q 342 is immediately (in the minibuffer) converted into an 'a'
with a grave symbol over it. Putting the point on the backslash (\)
preceding the 342 in the antiword-converted buffer and doing "C-u C-x ="
indeed shows this a-with-grave character to be (0342, 226, 0xe2).

To create a simple test case, do the following:

Open an empty *scratch* buffer. Enter into it: C-q 342 Return C-q 200
Return C-q 231 Return. The first character that appears is the
a-with-grave; the second and third characters appear properly as
\200\231.

It is, I think, the failure of C-q 342 to be represented as \342 which
is the problem. What is the solution?


tia,
ken



[....]
Tim X
2004-05-02 08:57:14 UTC
Permalink
Don> What related emacs commands are there that might convert an
Don> rmail attachment from .doc to plain text ascii ?...

Don> It is an rmail message distributed from local government about
Don> an upcoming public hearing.

There are two solutions I've used for this. The first is a set of
utilities called wvWare - I think they are related to abiword. At any
rate, if your using Debian, just install wv.

The second product I've used is one called catdoc. Its not quite as
powerful, but works reasonably well.

If your using VM as your mail reader, its trivial to configure it to
run either the wv utility or catdoc on the attachment and have it
display in the buffer as text. With wv, I think you also have the
option to have it rendered as HTML as well.

As a last resort, you could use "strings" on the document to get the
content, but you will probably have a fair amount of crap mixed in
with it.

Note that the only time I've found the wv utilities have failed is
when I've recieved attachments witht e msword mime type, but which are
actually M$ bloody RTF format. I have'nt worked out a reliable way to
translate M$ RTF (which is not the rich text format we all knew a
decade ago!).

I would also contact the authority who send word documents and request
they use a less proprietry format - even PDF is better!

tim
--
Tim Cross
The e-mail address on this message is FALSE (obviously!). My real e-mail is
to a company in Australia called rapttech and my login is tcross - if you
really need to send mail, you should be able to work it out!
Loading...