overriding ctype<wchar_t>

Posted by Potatoswatter on Stack Overflow See other posts from Stack Overflow or by Potatoswatter
Published on 2010-02-26T05:00:28Z Indexed on 2010/03/25 0:23 UTC
Read the original article Hit count: 741

Filed under:
|
|
|

I'm writing a lambda calculus interpreter for fun and practice. I got iostreams to properly tokenize identifiers by adding a ctype facet which defines punctuation as whitespace:

struct token_ctype : ctype<char> {
 mask t[ table_size ];
 token_ctype()
 : ctype<char>( t ) {
  for ( size_t tx = 0; tx < table_size; ++ tx ) {
   t[tx] = isalnum( tx )? alnum : space;
  }
 }
};

(classic_table() would probably be cleaner but that doesn't work on OS X!)

And then swap the facet in when I hit an identifier:

locale token_loc( in.getloc(), new token_ctype );
…
locale const &oldloc = in.imbue( token_loc );
in.unget() >> token;
in.imbue( oldloc );

There seems to be surprisingly little lambda calculus code on the Web. Most of what I've found so far is full of unicode ? characters. So I thought to try adding Unicode support.

But ctype<wchar_t> works completely differently from ctype<char>. There is no master table; there are four methods do_is x2, do_scan_is, and do_scan_not. So I did this:

struct token_ctype : ctype< wchar_t > {
 typedef ctype<wchar_t> base;

 bool do_is( mask m, char_type c ) const {
  return base::do_is(m,c)
  || (m&space) && ( base::do_is(punct,c) || c == L'?' );
 }

 const char_type* do_is
  (const char_type* lo, const char_type* hi, mask* vec) const {
  base::do_is(lo,hi,vec);
  for ( mask *vp = vec; lo != hi; ++ vp, ++ lo ) {
   if ( *vp & punct || *lo == L'?' ) *vp |= space;
  }
  return hi;
 }

 const char_type *do_scan_is
  (mask m, const char_type* lo, const char_type* hi) const {
  if ( m & space ) m |= punct;
  hi = do_scan_is(m,lo,hi);
  if ( m & space ) hi = find( lo, hi, L'?' );
  return hi;
 }

 const char_type *do_scan_not
  (mask m, const char_type* lo, const char_type* hi) const {
  if ( m & space ) {
   m |= punct;
   while ( * ( lo = base::do_scan_not(m,lo,hi) ) == L'?' && lo != hi )
    ++ lo;
   return lo;
  }
  return base::do_scan_not(m,lo,hi);
 }
};

(Apologies for the flat formatting; the preview converted the tabs differently.)

The code is WAY less elegant. I does better express the notion that only punctuation is additional whitespace, but that would've been fine in the original had I had classic_table.

Is there a simpler way to do this? Do I really need all those overloads? (Testing showed do_scan_not is extraneous here, but I'm thinking more broadly.) Am I abusing facets in the first place? Is the above even correct? Would it be better style to implement less logic?

© Stack Overflow or respective owner

Related posts about c++

Related posts about unicode