By
Sahir Shah
24 - June - 2007
The ASCII character set is sufficient to represent the Latin alphabet used by most languages of Europe. The 8 bit ASCII character encoding scheme is also known as SBCS (Single Byte Character Set). Some languages use scripts which require many more characters than can be held in a single byte character set. For example the Dai Kan-Wa jiten a Japanese dictionary of Kanji contains 50,000 character entries. The Multi Byte Character Scheme (MBCS) and UNICODE were developed to allow computer systems to represent text in any of the world's languages. MBCS uses a varying width scheme where some characters can be either 1 byte or 2 bytes wide. The value of the lead byte indicates if the next byte is part of a double byte character. UNICODE is a character set where each character is represented in 2 bytes.
The basic types for 8 bit and 16 bit characters are char and wchar_t, all other types are synonyms created by typedef declarations. Some other types such _bstr_t and CString are encapsulations of string types. Some of these types are single byte or double byte depending on UNICODE being defined. For example TCHAR is defined in WinNT.h as
#ifdef UNICODE
typedef WCHAR TCHAR;
#else
typedef char TCHAR;
#endif
and WCHAR is in turn a synoonym for wchar_t . Character types which are always single byte are : CHAR, UCHAR, PCHAR and PUCHAR. The last two are pointers to an 8 bit char. TCHAR can either be a single byte char or a wide char and WCHAR is always a wchar_t. And then there is the OLECHAR which is a WCHAR when OLE2ANSI is not defined.
The string types are
always 8 bit - LPCSTR, LPSTR, PCSTR, PSTR
8 bit or 16 bit - LPCTSTR, LPTSTR, PCTSTR, PTSTR
always 16 bit - LPCWSTR, LPWSTR, PCWSTR, PWSTR , BSTR
Conversion between 1 and 2 byte character encoding schemes are meaningful only in the case of scripts such as Latin which can be respresented in a single byte encoding scheme. And in the case of Latin script MBCS to SBCS conversion is a non issue. From what I have seen of posts in C++ forums what is of interest to many is the conversion of Latin text to a char array in a unicode enabled application. There is the W2A macro and the MultiByteToWideChar function, but I found that both of them do not work very well in VS 2005 SP1. In the absence of a framework provided function or macro the only thing we can do is to write our own conversion routine. What we can use to our advantage is the way Latin characters in a Unicode string are stored in memory. Since the Latin alphabet only uses one byte, only the least significant byte (or lead byte) contains a value. If we do a narrowing conversion to a single byte type all bits except the ones in the lead byte (i.e least significant byte) are lost. Since the high order byte contains no value anyway, conversion can be achieved without any loss of precision.
int main(){
LPCWSTR wstr = L"hello world";
int count = wcslen(wstr);
char* c = new char[count + 1];
wchar_t* pwchr = const_cast<wchar_t*> (&wstr[0]);
for(int j = 0; j < count; ++j){
c[j] = static_cast<char> (*pwchr);
pwchr++;
}
c[count] = '\0';
cout<<c<<endl;
return 0;
}
When UNICODE is enabled each character in a CString object is a wchar_t. Converting a CString to a 8 bit char array can be done using the CString's mid function.
int main(){
CString cs = L"hello world";
int count = cs.GetLength();
char* c = new char[count + 1];
for(int j = 0; j < count; j++){
c[j] = static_cast<char> (* cs.Mid(j, 1));
}
c[count] = '\0';
cout<<c<<endl;
return 0;
}
A _bstr_t can be converted to a char array by using the GetBSTR() function getting the underlying BSTR and then using the same method as that was used for converting the LPCWSTR.
#include <comutil.h>
#pragma comment(lib, "comsupp.lib")
int main(){
_bstr_t bstr = L"hello world";
int count = bstr.length();
char* c = new char[count + 1];
BSTR bstr2 = bstr.GetBSTR();
wchar_t* pwchr = const_cast<wchar_t*> (&bstr2[0]);
for(int j = 0; j < count; ++j) {
c[j] = static_cast<char> (*pwchr);
pwchr++;
}
c[count] = '\0';
cout<<c<<endl;
return 0;
}