[Winpcap-users] pcap_open_offline and unicode charsets

Guy Harris guy at alum.mit.edu
Sun Nov 8 11:07:06 PST 2009


On Nov 7, 2009, at 5:23 PM, Mark Bednarczyk wrote:

> What support is there, for unicode character based file names in  
> WinPcap to functions such as pcap_open_offline?
>
>  I have users that are trying to open a file with some chineese  
> characters in its filename. As far as I understand it, fopen under  
> unix (especially under linux) should handle unicode 8-bit with no  
> problems.

I presume by "unicode 8-bit" you mean UTF-8-encoded Unicode.

fopen() on UN*Xes passes the pathname on to open(); that means that it  
should handle any sequence of octets as long as the octet value 0x2f  
is used *only* as a pathname component separator and the octet value  
0x00 is used *only* as a pathname string terminator.

Most local file systems will not attempt to interpret that string,  
except to treate 0x2f (/) as a component separator and 0x00 ('\0') as  
a pathname string terminator.  Whether a particular file name is  
encoded as UTF-8, or ISO 8859/1, ... is another matter; I have the  
impression that various UN*Xes are tending towards UTF-8 as the most  
common encoding, but there are probably still systems using other  
encodings.

> Linux also handles wider widths but only in a non-intentional way  
> where wider width chars are handled as 8-bit entities (ie. 0x1065 is  
> handled as 2 separate 8-bit chars: 0x65 and 0x10 where order is  
> dependent on processor endianness.)

I would hope it does no such thing, especially with, for example, the  
wide character 0x2f65 (⽥) - if you hand any UN*X API that takes  
pathnames an octet sequenc containing the octet 0x2f followed by the  
octet 0x65, I would hope that it would be interpreted as containing "/ 
e", and, similarly, if you had it a string containing the octet 0x65  
followed by the octet 0x2f, I would hope that it would be interpreted  
as containing "e/".

> Under MSFC is different and you have to use MS specific wfopen and  
> wopen calls which take unicode (or wide chars).
>
> Does WinPcap provide any support for unicode and call the  
> appropriate "open" function?

No, it just uses fopen(), just as libpcap does on UN*X.

In theory, it could convert from UTF-8 to UTF-16 and call _wfopen(),  
but that could conceivably break existing applications that either  
explicitly or implicitly expect the path argument to  
pcap_open_offline() to work the same as the path argument to fopen().

My inclination would be to, in WinPcap, provide pcap_wopen_offline(),  
or something such as that, taking a UTF-16 pathname as an argument.

> My library gets its filename from a java string and it currently  
> converts it to plain UTF-8 charset and that works fine.

On UN*X, it should perhaps be converted to whatever the locale's  
filename character set is.

I'm not sure how that would be determined, however.  I might be  
tempted to assume that, if the environment variable LC_CTYPE is set  
that specifies the encoding, otherwise if LANG is set that specifies  
the encoding, otherwise it might be the C locale (which, I think,  
unfortunately says the encoding is ASCII).  However, GLib (not glibc,  
GLib) has its own additional environment variables:

	http://library.gnome.org/devel/glib/stable/glib-running.html

and I'm not sure why that's the case.

> But in reality I'd like to support all unicode widths 8, 16 and even  
> 32 bit. I'm not sure how those wider unicode chars would be handled.

How are they handled elsewhere in Java?  The File class seems to work  
with Strings, and the String class, at least as I understand the  
documentation, uses UTF-16 (presumably that's what you mean by  
"unicode [width] ... 16 ... bit").


More information about the Winpcap-users mailing list