Discussion:
CommandLineToArgvA?
(too old to reply)
Vincent Fatica
2009-06-04 01:10:58 UTC
Permalink
Is there a function that will parse a multibyte string, producing a count and
distinct multibyte args (similar to CommandLineToArgvW)? The string I want to
parse is not a command line but I want to treat it exactly like a command line
and wind up with multibyte args. Thanks.
--
- Vince
Tim Roberts
2009-06-05 03:18:18 UTC
Permalink
Post by Vincent Fatica
Is there a function that will parse a multibyte string, producing a count and
distinct multibyte args (similar to CommandLineToArgvW)? The string I want to
parse is not a command line but I want to treat it exactly like a command line
and wind up with multibyte args. Thanks.
How much sophistication do you need? CStringW::Tokenize can split a string
up into tokens based on separator characters. It doesn't handle quoted
parameters, however.

Or, you could just convert to Unicode and call CommandLineToArgvW...
--
Tim Roberts, ***@probo.com
Providenza & Boekelheide, Inc.
Vincent Fatica
2009-06-05 13:48:07 UTC
Permalink
On Thu, 04 Jun 2009 20:18:18 -0700, Tim Roberts <***@probo.com> wrote:

|How much sophistication do you need? CStringW::Tokenize can split a string
|up into tokens based on separator characters. It doesn't handle quoted
|parameters, however.

I want the handling of quotes ergs.

|Or, you could just convert to Unicode and call CommandLineToArgvW...

... and convert the individual args back to multibyte. I did exactly that,
imitating the allocation scheme (with LocalAlloc) of CommandLineToArgvW so that
a single LocalFree() will free the pointers and the strings. My allocation
assumes that the individual MBCS args will have the same lengths as their wide
counterparts (because I allocated before converting the strings back to
multibyte). Is there a way to *guarantee* that my assumption is valid?
--
- Vince
Vincent Fatica
2009-06-05 15:34:14 UTC
Permalink
On 5 Jun 2009 09:48:07 -0400, Vincent Fatica <***@blackholespam.net> wrote:

||Or, you could just convert to Unicode and call CommandLineToArgvW...
|
| ... and convert the individual args back to multibyte. I did exactly that,
|imitating the allocation scheme (with LocalAlloc) of CommandLineToArgvW so that
|a single LocalFree() will free the pointers and the strings. My allocation
|assumes that the individual MBCS args will have the same lengths as their wide
|counterparts (because I allocated before converting the strings back to
|multibyte). Is there a way to *guarantee* that my assumption is valid?

Never mind. Instead of MultiByteToWideChar and WideCharToMultiByte, I just did
the conversions by assignment. That will insure the strings are the same
length, that the characters in the parsed string are exactly the same as in the
original string (and it doesn't mess with '"' and '\\' which are important to
CommandLineToArgvW).
--
- Vince
Tim Roberts
2009-06-07 02:13:11 UTC
Permalink
Post by Vincent Fatica
|How much sophistication do you need? CStringW::Tokenize can split a string
|up into tokens based on separator characters. It doesn't handle quoted
|parameters, however.
I want the handling of quotes ergs.
|Or, you could just convert to Unicode and call CommandLineToArgvW...
... and convert the individual args back to multibyte.
Well, this is getting a bit off track of your original query, but you might
consider whether this is the time to convert your whole app to Unicode.
There are distinct advantages to doing so, including a slight performance
boost.
--
Tim Roberts, ***@probo.com
Providenza & Boekelheide, Inc.
Vincent Fatica
2009-06-07 04:01:05 UTC
Permalink
On Sat, 06 Jun 2009 19:13:11 -0700, Tim Roberts <***@probo.com> wrote:

|Well, this is getting a bit off track of your original query, but you might
|consider whether this is the time to convert your whole app to Unicode.
|There are distinct advantages to doing so, including a slight performance
|boost.

I usually write everything in Unicode. But the project in question is a plugin
DLL for a MBCS app. At plugin init time, the host app passes a single (MBCS)
string, a user parameter. I wanted to parse it like a command line to give the
user greater flexibility (namely quoted strings being a single arg) and to allow
me to use a normal process_argv routine. I came up with this, which works well
(and lacks EC). It works just like CommandLineToArgvW.

CHAR** WINAPI MBStringToMBArgv(LPSTR str, INT *pargc)
{
// alloc memory for a wide version of str
LPWSTR wstr = (LPWSTR) LocalAlloc(LMEM_FIXED,
(lstrlenA(str) + 1) * sizeof(WCHAR));

// "copy" str to wstr
WCHAR *wp = wstr;
while ( *wp++ = *str++ );

// parse wstr
WCHAR **wargv = CommandLineToArgvW(wstr, pargc);

// cleanup
LocalFree(wstr);

// determine memory needed for argv
size_t needed = *pargc * (sizeof(CHAR*) + 1); // ptrs and NULs
for ( INT i=0; i<*pargc; i++ ) // add arg lengths
needed += lstrlenW(wargv[i]);

// allocate memory for ptrs and args
LPBYTE argv = (LPBYTE) LocalAlloc(LMEM_FIXED, needed);

// fill the pointers and strings
CHAR **ptrs = (CHAR**) argv;
CHAR *parg = (CHAR*) ((CHAR**) argv + *pargc);
for ( INT i=0; i<*pargc; i++ )
{
ptrs[i] = parg;
wp = wargv[i];
while ( *parg++ = (CHAR) *wp++ );
}

// cleanup
LocalFree(wargv);

// when done with it use LocalFree() on the returned pointer
return (CHAR**) argv;
}
--
- Vince
r***@gmail.com
2009-06-07 04:59:09 UTC
Permalink
Post by Vincent Fatica
CHAR** WINAPI MBStringToMBArgv(LPSTR str, INT *pargc)
{
        // alloc memory for a wide version of str
        LPWSTR wstr = (LPWSTR) LocalAlloc(LMEM_FIXED,
                                        (lstrlenA(str) + 1) * sizeof(WCHAR));
        // "copy" str to wstr
        WCHAR *wp = wstr;
        while ( *wp++ = *str++ );
You should use a proper multibyte to widecode conversion function so
you don't do the wrong thing if someone sends you a multibyte string
with a multibyte character.
Post by Vincent Fatica
        // parse wstr
        WCHAR **wargv = CommandLineToArgvW(wstr, pargc);
        // cleanup
        LocalFree(wstr);
        // determine memory needed for argv
        size_t needed = *pargc * (sizeof(CHAR*) + 1);   // ptrs and NULs
        for ( INT i=0; i<*pargc; i++ )                                       // add arg lengths
                needed += lstrlenW(wargv[i]);
        // allocate memory for ptrs and args
        LPBYTE argv = (LPBYTE) LocalAlloc(LMEM_FIXED, needed);
        // fill the pointers and strings
        CHAR **ptrs = (CHAR**) argv;
        CHAR *parg = (CHAR*) ((CHAR**) argv + *pargc);
        for ( INT i=0; i<*pargc; i++ )
        {
                ptrs[i] = parg;
                wp = wargv[i];
                while ( *parg++ = (CHAR) *wp++ );
        }
        // cleanup
        LocalFree(wargv);
        // when done with it use LocalFree() on the returned pointer
        return (CHAR**) argv;}
--
 - Vince
Vincent Fatica
2009-06-07 07:27:29 UTC
Permalink
On Sat, 6 Jun 2009 21:59:09 -0700 (PDT), ***@gmail.com wrote:

|>         // "copy" str to wstr
|>         WCHAR *wp = wstr;
|>         while ( *wp++ = *str++ );
|
|You should use a proper multibyte to widecode conversion function so
|you don't do the wrong thing if someone sends you a multibyte string
|with a multibyte character.

I'm not very confident that you can MultiByteToWideChar then WideCharToMultiByte
and wwind up where you started.
--
- Vince
Scot T Brennecke
2009-06-07 09:01:27 UTC
Permalink
|> // "copy" str to wstr
|> WCHAR *wp = wstr;
|> while ( *wp++ = *str++ );
|
|You should use a proper multibyte to widecode conversion function so
|you don't do the wrong thing if someone sends you a multibyte string
|with a multibyte character.
I'm not very confident that you can MultiByteToWideChar then WideCharToMultiByte
and wwind up where you started.
If not, it would be a reportable bug. Do you have any evidence to
suggest that wouldn't work? If so, let's report it.
Vincent Fatica
2009-06-07 13:59:46 UTC
Permalink
On Sun, 07 Jun 2009 04:01:27 -0500, Scot T Brennecke <***@Spamhater.MVPs.org>
wrote:

|> I'm not very confident that you can MultiByteToWideChar then WideCharToMultiByte
|> and wwind up where you started.
|
|If not, it would be a reportable bug. Do you have any evidence to
|suggest that wouldn't work? If so, let's report it.

The file 00ff.bin contains each byte, 0~255. CP 875 is Greek. This code gives
the results below it.

BYTE before[256], after[256];
for ( INT i=0; i<256; i++ )
before[i] = i;
WCHAR wbuf[256];
DWORD dwRead;
MultiByteToWideChar(875, 0, (CHAR*) before, 256, wbuf, 256);
WideCharToMultiByte(875, 0, wbuf, 256, (CHAR*) after, 256, NULL, FALSE);
for ( INT i=0; i<256; i++ )
{
if ( before[i] != after[i] )
printf("%u %u\n", before[i], after[i]);
}

220 63
225 63
236 63
237 63
252 63
253 63
--
- Vince
Vincent Fatica
2009-06-07 15:10:58 UTC
Permalink
On Sun, 7 Jun 2009 10:41:34 -0400, "Igor Tandetnik" <***@mvps.org> wrote:

|Those codepoints are not valid in CP875. Of course you can only expect
|roundtrip if the original string is a valid MBCS string for its codepage
|to begin with. In case you are wondering, 63 is the code for question
|mark '?'.

Oddly, if I use MB_ERR_INVALID_CHARS in MultiByteToWideChar it still succeeds.
Going back, with WideCharToMultiByte and WC_ERR_INVALID_CHARS, it fails.

I don't want to be the policeman. Do you think my method of simple assignment
to convert CHAR <-> WCHAR will foul up CommandLineToArgvW? If the user provides
garbage, I figure he'll get back.

|But I'm not sure why you _need_ a roundtrip. You say your plugin is
|Unicode, except for this one parameter string. So you only need to
|convert it one way, so that everything is now Unicode, right?

I said my plugin was MBCS (as is the hosting app). It uses no C library
functions. I could convert it to Unicode but I'd find myself calling "A"
functions most of the time anyway.
--
- Vince
Igor Tandetnik
2009-06-07 15:44:47 UTC
Permalink
Post by Vincent Fatica
I don't want to be the policeman. Do you think my method of simple
assignment to convert CHAR <-> WCHAR will foul up CommandLineToArgvW?
No. But if the current system codepage is in fact CP1253 (Windows
codepage for Greek), and the caller did want to pass some Greek
characters to you, you will silently convert them to accented latin
characters that just happen to have the same codes in Latin-1 aka
ISO-8859-1 codepage (which is what Unicode codepoints U+0000 through
U+00FF correspond to, for historical reasons).

For example, GREEK CAPITAL LETTER ALPHA is code 193 (hex 0xC1) in
CP1253. But you will interpret it as U+00C1, LATIN CAPITAL LETTER A WITH
ACUTE.

In other words, your technique only works correctly if you are sure the
incoming string consists entirely of plain vanilla ASCII-7 characters
(codepoints 0 through 127).
--
With best wishes,
Igor Tandetnik

With sufficient thrust, pigs fly just fine. However, this is not
necessarily a good idea. It is hard to be sure where they are going to
land, and it could be dangerous sitting under them as they fly
overhead. -- RFC 1925
Vincent Fatica
2009-06-07 16:15:51 UTC
Permalink
On Sun, 7 Jun 2009 11:44:47 -0400, "Igor Tandetnik" <***@mvps.org> wrote:

|No. But if the current system codepage is in fact CP1253 (Windows
|codepage for Greek), and the caller did want to pass some Greek
|characters to you, you will silently convert them to accented latin
|characters that just happen to have the same codes in Latin-1 aka
|ISO-8859-1 codepage (which is what Unicode codepoints U+0000 through
|U+00FF correspond to, for historical reasons).
|
|For example, GREEK CAPITAL LETTER ALPHA is code 193 (hex 0xC1) in
|CP1253. But you will interpret it as U+00C1, LATIN CAPITAL LETTER A WITH
|ACUTE.

What's the problem? When I convert each Unicode argv back to MBCS with

while ( *p++ == (CHAR) *wp++ );

won't it go back to 193 (and again be interpreted as GREEK CAPITAL LETTER
ALPHA)? I don't think CommandLineToArgvW cares whether it's GREEK CAPITAL
LETTER ALPHA or LATIN CAPITAL LETTER A WITH ACUTE. I'm assuming
CommandLineToArgvW only **interprets** whitespace, backslashes, and
double-quotes.
--
- Vince
Igor Tandetnik
2009-06-07 16:49:09 UTC
Permalink
Post by Vincent Fatica
On Sun, 7 Jun 2009 11:44:47 -0400, "Igor Tandetnik"
Post by Igor Tandetnik
No. But if the current system codepage is in fact CP1253 (Windows
codepage for Greek), and the caller did want to pass some Greek
characters to you, you will silently convert them to accented latin
characters that just happen to have the same codes in Latin-1 aka
ISO-8859-1 codepage (which is what Unicode codepoints U+0000 through
U+00FF correspond to, for historical reasons).
For example, GREEK CAPITAL LETTER ALPHA is code 193 (hex 0xC1) in
CP1253. But you will interpret it as U+00C1, LATIN CAPITAL LETTER A
WITH ACUTE.
What's the problem? When I convert each Unicode argv back to MBCS with
while ( *p++ == (CHAR) *wp++ );
won't it go back to 193 (and again be interpreted as GREEK CAPITAL
LETTER ALPHA)? I don't think CommandLineToArgvW cares whether it's
GREEK CAPITAL LETTER ALPHA or LATIN CAPITAL LETTER A WITH ACUTE. I'm
assuming CommandLineToArgvW only **interprets** whitespace,
backslashes, and double-quotes.
Ah, I didn't realize you were going to Unicode and back. Anyway, you'd
still have problems with true double-byte encodings, like Chinese BIG-5
or Japanese Shift-JIS. In these encodings, some characters are
represented by two bytes, called lead byte and trailing byte. Lead byte
always has high bit set, but trailing byte could have any value at all,
including values that just happen to be the same as ASCII codes for
space, backslash or double quote.

Your naive algorithm will convert such double-byte character to two
independent Unicode codepoints. The codepoint corresponding to the
trailing byte could then be interpreted by CommandLineToArgvW as a
separator. As a result, a) some parameter will be broken up in the
middle, and b) when your algorithm converts back from Unicode to MBCS,
you'll end up with a lead byte not followed by a trailing byte (or
followed by an unrelated ASCII character that will be misinterpreted as
a trailing byte).
--
With best wishes,
Igor Tandetnik

With sufficient thrust, pigs fly just fine. However, this is not
necessarily a good idea. It is hard to be sure where they are going to
land, and it could be dangerous sitting under them as they fly
overhead. -- RFC 1925
Vincent Fatica
2009-06-07 17:24:19 UTC
Permalink
On Sun, 7 Jun 2009 12:49:09 -0400, "Igor Tandetnik" <***@mvps.org> wrote:

|Vincent Fatica wrote:
|> On Sun, 7 Jun 2009 11:44:47 -0400, "Igor Tandetnik"
|> <***@mvps.org> wrote:
|>
|>> For example, GREEK CAPITAL LETTER ALPHA is code 193 (hex 0xC1) in
|>> CP1253. But you will interpret it as U+00C1, LATIN CAPITAL LETTER A
|>> WITH ACUTE.
|>
|> What's the problem? When I convert each Unicode argv back to MBCS
|> with
|>
|> while ( *p++ == (CHAR) *wp++ );
|>
|> won't it go back to 193 (and again be interpreted as GREEK CAPITAL
|> LETTER ALPHA)? I don't think CommandLineToArgvW cares whether it's
|> GREEK CAPITAL LETTER ALPHA or LATIN CAPITAL LETTER A WITH ACUTE. I'm
|> assuming CommandLineToArgvW only **interprets** whitespace,
|> backslashes, and double-quotes.
|
|Ah, I didn't realize you were going to Unicode and back. Anyway, you'd
|still have problems with true double-byte encodings, like Chinese BIG-5
|or Japanese Shift-JIS. In these encodings, some characters are
|represented by two bytes, called lead byte and trailing byte. Lead byte
|always has high bit set, but trailing byte could have any value at all,
|including values that just happen to be the same as ASCII codes for
|space, backslash or double quote.
|
|Your naive algorithm will convert such double-byte character to two
|independent Unicode codepoints. The codepoint corresponding to the
|trailing byte could then be interpreted by CommandLineToArgvW as a
|separator. As a result, a) some parameter will be broken up in the
|middle, and b) when your algorithm converts back from Unicode to MBCS,
|you'll end up with a lead byte not followed by a trailing byte (or
|followed by an unrelated ASCII character that will be misinterpreted as
|a trailing byte).

Yes, I see. STDARGV.C deals with this (if (_ismbblead(c)) ...). Do you think
that I could (possibly with some effort) include STDARGV.C in my project and use
its parse_cmdline()?
--
- Vince
Igor Tandetnik
2009-06-07 14:41:34 UTC
Permalink
Post by Vincent Fatica
On Sun, 07 Jun 2009 04:01:27 -0500, Scot T Brennecke
Post by Scot T Brennecke
Post by Vincent Fatica
I'm not very confident that you can MultiByteToWideChar then
WideCharToMultiByte and wwind up where you started.
If not, it would be a reportable bug. Do you have any evidence to
suggest that wouldn't work? If so, let's report it.
The file 00ff.bin contains each byte, 0~255. CP 875 is Greek. This
code gives the results below it.
220 63
225 63
236 63
237 63
252 63
253 63
http://www.ascii.ca/ebc875.htm

Those codepoints are not valid in CP875. Of course you can only expect
roundtrip if the original string is a valid MBCS string for its codepage
to begin with. In case you are wondering, 63 is the code for question
mark '?'.

But I'm not sure why you _need_ a roundtrip. You say your plugin is
Unicode, except for this one parameter string. So you only need to
convert it one way, so that everything is now Unicode, right?
--
With best wishes,
Igor Tandetnik

With sufficient thrust, pigs fly just fine. However, this is not
necessarily a good idea. It is hard to be sure where they are going to
land, and it could be dangerous sitting under them as they fly
overhead. -- RFC 1925
Vincent Fatica
2009-06-07 14:24:52 UTC
Permalink
On Sat, 6 Jun 2009 21:59:09 -0700 (PDT), ***@gmail.com wrote:

|>         // "copy" str to wstr
|>         WCHAR *wp = wstr;
|>         while ( *wp++ = *str++ );
|
|You should use a proper multibyte to widecode conversion function so
|you don't do the wrong thing if someone sends you a multibyte string
|with a multibyte character.

Do you think CommandLineToArgvW cares about that?
--
- Vince
David Wilkinson
2009-06-07 17:44:10 UTC
Permalink
Post by Vincent Fatica
|You should use a proper multibyte to widecode conversion function so
|you don't do the wrong thing if someone sends you a multibyte string
|with a multibyte character.
Do you think CommandLineToArgvW cares about that?
If it cares about getting the right answer, I would think it would care about
having the correct input. Only if all the characters are ASCII can you do the
conversion in a simple character-by-character manner.
--
David Wilkinson
Visual C++ MVP
David Wilkinson
2009-06-05 10:15:58 UTC
Permalink
Post by Vincent Fatica
Is there a function that will parse a multibyte string, producing a count and
distinct multibyte args (similar to CommandLineToArgvW)? The string I want to
parse is not a command line but I want to treat it exactly like a command line
and wind up with multibyte args. Thanks.
In MFC there is the CCommandLineInfo class.
--
David Wilkinson
Visual C++ MVP
u***@gmail.com
2017-02-05 03:32:24 UTC
Permalink
Post by Vincent Fatica
Is there a function that will parse a multibyte string, producing a count and
distinct multibyte args (similar to CommandLineToArgvW)? The string I want to
parse is not a command line but I want to treat it exactly like a command line
and wind up with multibyte args. Thanks.
--
- Vince
Please see the WINE project

https://www.winehq.org/

It's awesome, they contains a source code of `CommandLineToArgvW`, which should meet your needs.
Loading...