scanner generating kit


StrScanner user guide

Porpose of this extension

StrScanner is Ruby extension for fast scanning.

Since Regexp class of Ruby cannot match to sub-string, to scan string you must make new String. For example

       p " I_want_to_match_this_word but can't".index( /\A\w+/, 1 )
This code display "nil". Another way to match is as like this:

str = " word word word"
while str.size > 0 do
  if /\A[ \t]+/ === str then
    str = $'
  elsif /\A\w+/ === str then
    str = $'
  end
end

But, this method has big problem on speed issue. $' makes new string EVERY time. Then, in this example, all these strings are created:
" word word word",
"word word word",
" word word"
"word word"
" word"
"word"
""

This makes heavy load. If length of 'str' is 50KB, nearly 50KB ** 2 / 5 = 50MB memory is used!!

StrScanner class resolve this.
StrScanner has C string and pointer to it. When scanning, StrScanner do only increment pointer and not create new string. As a result, both of speed and application memory size decrease.

simple examples, and methods

Then, here's two short example of scanning routine.
First is easy to write but slow scanning code. Second is also easy to write, but FAST scanning code using StrScanner class.

First example:

ATOM = /\A\w+/
SPACE = /\A[ \t]+/

while str.size > 0 do
  if ATOM === str then
    str = $'
    return $&
  elsif SPACE === str then
    str = $'
    return $&
  end
end

Second example:

ATOM = /\A\w+/
SPACE = /\A[ \t]+/

s = StrScanner.new( str )
while s.rest? do
  if temp = s.scan( ATOM ) then
    return temp
  elsif temp = s.scan( SPACE ) then
    return temp
  end
end

Usage of StrScanner is simple.
First: Create StrScanner object, next call 'scan' method. It return matched string and at the same time it increments its internal maintained "scan pointer". It is simply implemented as pointer to char(char*).
'skip' method is similer to 'scan', but it returns length of matched string.

s = StrScanner.new( "abcdefg" )   # scan pointer is on 'a', index 0
puts s.scan( /\Aa/ )              # return 'a'. scan pointer is on 'b', index 1
puts s.skip( /\Abc/ )             # return 2. scan pointer is on 'd', index 3
continue...

At that time previous "scan pointer" is preserved in StrScanner object. Then, str[ prev pointer..current pointer ] means the string which is returned from 'scan' --- "matched string". We can get it by 'matched' method.

puts s.matched                    # return 'bc'. scan pointer don't move
puts s.scan( /\Aa/ )              # return nil. scan pointer don't move, too.
puts s.matched                    # return 'bc'.

To puts scan pointer back, is also permitted. 'unscan' method implements that. But 'unscan' can do only ONE times for one 'scan' because StrScanner object can't preserve more than one pointers.

puts s.scan( /\Ade/ )             # return 'de'. scan pointer is on 'f', index 5
s.unscan                          # scan pointer is on 'd', index 3
puts s.scan( /\Adef/ )            # return 'def'. scan pointer is on 'g', index 6

Yes, all these regexp begin with "\A". This is important. If regexp matching happen on non zero index, 'scan' (and other methods) return string from TOP OF POINTER to matched end. In example:

str = StrScanner.new( 'aaaabbbbcccc' ).scan( /bbbb/ )
p str    # will print "aaaabbbb"

For more details, see reference manual below (and/or make experiments). And of course, source code is most inportant documentation, I think :-)


StrScanner reference manual

Class Methods

new( str : String, dup_p = true ) : StrScanner
create new StrScanner object. 'str' is string to scan, 'dup_p' is a flag if duplicate string. dup_p may be kept untouch.

Methods

scan( regex: Regexp ): String
do match with 'regex'.
if match, make "scan pointer" forward and return matched string. else return nil.
skip( regex : Regexp ) : Integer
do match with 'regex'.
if match, make "scan pointer" forward and return length of matched string. else return nil.
match?( regex: Regexp ): Boolean
do match with 'regex'.
if match, keep "scan pointer" untouch and return length of matched string. else return nil.
fullscan( regex: Regexp, makestr_p: Boolean, fwdptr_p: Boolean ): Object
do match with 'regex'. if match then if fwdptr_p then forward pointer else keep untouch end return (makestr_p ? matched string : length of matched string) else return nil end
getch : String
return 1 byte which is pointed by "scan pointer", and make pointer forward.
rest : String
return string after the byte which is pointed by "scan pointer".
rest? : Boolean
return true if un-scanned string exists.
restsize : Integer
length of 'rest' string
unscan
set "scan pointer" back for one times. more than one times of 'unscan' for one 'scan' (or skip or...) raises ScanError.
matched
return previous matching string. It may done with 'scan' or 'skip'. 'match?' is not because 'match?' won't make "scan pointer" forward.
matchedsize
return length of 'matched' string.

Copyright(c) 1998-1999 Minero Aoki