diff options
author | Jordan K. Hubbard <jkh@FreeBSD.org> | 1993-06-18 04:22:21 +0000 |
---|---|---|
committer | Jordan K. Hubbard <jkh@FreeBSD.org> | 1993-06-18 04:22:21 +0000 |
commit | b76095a4307cc94ec7cd722853f9b032e45e6ea4 (patch) | |
tree | 890f91d43eec35dc2f71a54410491f6503ca5b38 /gnu/usr.bin/awk | |
parent | 7c434002a4e47486e9a2d7b2f32b1ddf42d37e2a (diff) | |
download | src-b76095a4307cc94ec7cd722853f9b032e45e6ea4.tar.gz src-b76095a4307cc94ec7cd722853f9b032e45e6ea4.zip |
Updated GNU utilities
Notes
Notes:
svn path=/cvs2svn/branches/unlabeled-1.1.1/; revision=9
Diffstat (limited to 'gnu/usr.bin/awk')
35 files changed, 31389 insertions, 0 deletions
diff --git a/gnu/usr.bin/awk/ACKNOWLEDGMENT b/gnu/usr.bin/awk/ACKNOWLEDGMENT new file mode 100644 index 000000000000..b6c3b0b0c692 --- /dev/null +++ b/gnu/usr.bin/awk/ACKNOWLEDGMENT @@ -0,0 +1,21 @@ +The current developers of Gawk would like to thank and acknowledge the +many people who have contributed to the development through bug reports +and fixes and suggestions. Unfortunately, we have not been organized +enough to keep track of all the names -- for that we apologize. + +Another group of people have assisted even more by porting Gawk to new +platforms and providing a great deal of feedback. They are: + + Hal Peterson <hrp@pecan.cray.com> (Cray) + Pat Rankin <gawk.rankin@EQL.Caltech.Edu> (VMS) + Michal Jaegermann <NTOMCZAK@vm.ucs.UAlberta.CA> (Atari, NeXT, DEC 3100) + Mike Lijewski <mjlx@eagle.cnsf.cornell.edu> (IBM RS6000) + Scott Deifik <scottd@amgen.com> (MSDOS 2.14) + Kent Williams (MSDOS 2.11) + Conrad Kwok (MSDOS earlier versions) + Scott Garfinkle (MSDOS earlier versions) + +Last, but far from least, we would like to thank Brian Kernighan who +has helped to clear up many dark corners of the language and provided a +restraining touch when we have been overly tempted by "feeping +creaturism". diff --git a/gnu/usr.bin/awk/COPYING b/gnu/usr.bin/awk/COPYING new file mode 100644 index 000000000000..3358a7be862a --- /dev/null +++ b/gnu/usr.bin/awk/COPYING @@ -0,0 +1,340 @@ + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc. + 675 Mass Ave, Cambridge, MA 02139, USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Library General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. + + END OF TERMS AND CONDITIONS + + Appendix: How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + <one line to give the program's name and a brief idea of what it does.> + Copyright (C) 19yy <name of author> + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + +Also add information on how to contact you by electronic and paper mail. + +If the program is interactive, make it output a short notice like this +when it starts in an interactive mode: + + Gnomovision version 69, Copyright (C) 19yy name of author + Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, the commands you use may +be called something other than `show w' and `show c'; they could even be +mouse-clicks or menu items--whatever suits your program. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the program, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the program + `Gnomovision' (which makes passes at compilers) written by James Hacker. + + <signature of Ty Coon>, 1 April 1989 + Ty Coon, President of Vice + +This General Public License does not permit incorporating your program into +proprietary programs. If your program is a subroutine library, you may +consider it more useful to permit linking proprietary applications with the +library. If this is what you want to do, use the GNU Library General +Public License instead of this License. + diff --git a/gnu/usr.bin/awk/FUTURES b/gnu/usr.bin/awk/FUTURES new file mode 100644 index 000000000000..b09656046b27 --- /dev/null +++ b/gnu/usr.bin/awk/FUTURES @@ -0,0 +1,120 @@ +This file lists future projects and enhancements for gawk. Items are listed +in roughly the order they will be done for a given release. This file is +mainly for use by the developers to help keep themselves on track, please +don't bug us too much about schedules or what all this really means. + +For 2.16 +======== +David: + Move to autoconf-based configure system. + + Allow RS to be a regexp. + + RT variable to hold text of record terminator + + RECLEN variable for fixed length records + + Feedback alloca.s changes to FSF + + Extensible hashing in memory of awk arrays + + Split() with null string as third arg to split up strings + + Analogously, setting FS="" would split the input record into individual + characters. + +Arnold: + Generalize IGNORECASE + any value makes it work, not just numeric non-zero + make it apply to *all* string comparisons + + Fix FILENAME to have an initial value of "", not "-" + + Clean up code by isolating system-specific functions in separate files. + + Undertake significant directory reorganization. + + Extensive manual cleanup: + Use of texinfo 2.0 features + Lots more examples + Document all of the above. + +In 2.17 +======= +David: + + Incorporate newer dfa.c and regex.c (go to POSIX regexps) + + Make regex + dfa less dependant on gawk header file includes + + General sub functions: + edit(line, pat, sub) and gedit(line, pat, sub) + that return the substituted strings and allow \1 etc. in the sub string. + +Arnold: + DBM storage of awk arrays. Try to allow multiple dbm packages + + ? Have strftime() pay attention to the value of ENVIRON["TZ"] + + Additional manual features: + Document posix regexps + Document use of dbm arrays + ? Add an error messages section to the manual + ? A section on where gawk is bounded + regex + i/o + sun fp conversions + +For 2.18 +======== + +Arnold: + Add chdir and stat built-in functions. + + Add function pointers as valid variable types. + + Add an `ftw' built-in function that takes a function pointer. + +David: + + Do an optimization pass over parse tree? + +For 2.19 or later: +================== +Add variables similar to C's __FILE__ and __LINE__ for better diagnostics +from within awk programs. + +Add an explicit concatenation operator and assignment version. + +? Add a switch statement + +Add the ability to seek on an open file and retrieve the current file position. + +Add lint checking everywhere, including check for use of builtin vars. +only in new awk. + +"restart" keyword + +Add |& + +Make awk '/foo/' files... run at egrep speeds + +Do a reference card + +Allow OFMT to be other than a floating point format. + +Allow redefining of builtin functions? + +Make it faster and smaller. + +For 3.x: +======== + +Create a gawk compiler? + +Create a gawk-to-C translator? (or C++??) + +Provide awk profiling and debugging. + + + diff --git a/gnu/usr.bin/awk/LIMITATIONS b/gnu/usr.bin/awk/LIMITATIONS new file mode 100644 index 000000000000..5877197aeb55 --- /dev/null +++ b/gnu/usr.bin/awk/LIMITATIONS @@ -0,0 +1,14 @@ +This file describes limits of gawk on a Unix system (although it +is variable even then). Non-Unix systems may have other limits. + +# of fields in a record: MAX_INT +Length of input record: MAX_INT +Length of output record: unlimited +Size of a field: MAX_INT +Size of a printf string: MAX_INT +Size of a literal string: MAX_INT +Characters in a character class: 2^(# of bits per byte) +# of file redirections: unlimited +# of pipe redirections: min(# of processes per user, # of open files) +double-precision floating point +Length of source line: unlimited diff --git a/gnu/usr.bin/awk/Makefile b/gnu/usr.bin/awk/Makefile new file mode 100644 index 000000000000..fdca82c4482e --- /dev/null +++ b/gnu/usr.bin/awk/Makefile @@ -0,0 +1,13 @@ +PROG= awk +SRCS= main.c eval.c builtin.c msg.c iop.c io.c field.c array.c \ + node.c version.c re.c awk.c regex.c dfa.c \ + getopt.c getopt1.c +CFLAGS+= -DGAWK +LDADD= -lm +DPADD= ${LIBM} +CLEANFILES+= awk.c y.tab.h + +MAN1= awk.0 + +.include <bsd.prog.mk> +.include "../../usr.bin/Makefile.inc" diff --git a/gnu/usr.bin/awk/NEWS b/gnu/usr.bin/awk/NEWS new file mode 100644 index 000000000000..6711373d6ea5 --- /dev/null +++ b/gnu/usr.bin/awk/NEWS @@ -0,0 +1,1295 @@ +Changes from 2.15.1 to 2.15.2 +--------------------------- + +Additions to the FUTURES file. + +Document undefined order of output when using both standard output + and /dev/stdout or any of the /dev output files that gawk emulates in + the absence of OS support. + +Clean up the distribution generation in Makefile.in: the info files are + now included, the distributed files are marked read-only and patched + distributions are now unpacked in a directory named with the patch level. + + +Changes from 2.15 to 2.15.1 +--------------------------- + +Close stdout and stderr before all redirections on program exit. This allows + detection of write errors and also fixes the messages test on Solaris 2.x. + +Removed YYMAXDEPTH define in awk.y which was limiting the parser stack depth. + +Changes to config/bsd44, Makefile.bsd44 and configure to bring it into line + with the BSD4.4 release. + +Changed Makefile to use prefix, exec_prefix, bindir etc. + +make install now installs info files. + +make install now sets permissions on installed files. + +Make targets added: uninstall, distclean, mostlyclean and realclean. + +Added config.h to cleaner and clobber make targets. + +Changes to config/{hpux8x,sysv3,sysv4,ultrix41} to deal with alloca(). + +Change to getopt.h for portability. + +Added more special cases to the getpgrp() call. + +Added README.ibmrt-aos and config/ibmrt-aos. + +Changes from 2.14 to 2.15 +--------------------------- + +Command-line source can now be mixed with library functions. + +ARGIND variable tracks index in ARGV of FILENAME. + +GNU style long options in addition to short options. + +Plan 9 style special files interpreted by gawk: + /dev/pid + /dev/ppid + /dev/pgrpid + /dev/user + $1 = getuid + $2 = geteuid + $3 = getgid + $4 = getegid + $5 ... $NF = getgroups if supported + +ERRNO variable contains error string if getline or close fails. + +Very old options -a and -e have gone away. + +Inftest has been removed from the default target in test/Makefile -- the + results were too machine specific and resulted in too many false alarms. + +A README.amiga has been added. + +The "too many arguments supplied for format string" warning message is only + in effect under the lint option. + +Code improvements in dfa.c. + +Fixed all reported bugs: + + Writes are checked for failure (such as full filesystem). + + Stopped (at least some) runaway error messages. + + gsub(/^/, "x") does the right thing for $0 of 0, 1, or more length. + + close() on a command being piped to a getline now works properly. + + The input record will no longer be freed upon an explicit close() + of the input file. + + A NUL character in FS now works. + + In a substitute, \\& now means a literal backslash followed by what + was matched. + + Integer overflow of substring length in substr() is caught. + + An input record without a newline termination is handled properly. + + In io.c, check is against only EMFILE so that system file table + is not filled. + + Renamed all files with names longer than 14 characters. + + Escaped characters in regular expressions were being lost when + IGNORECASE was used. + + Long source lines were not being handled properly. + + Sourcefiles that ended in a tab but no newline were bombing. + + Patterns that could match zero characters in split() were not working + properly. + + The parsedebug option was not working. + + The grammar was being a bit too lenient, allowing some very dubious + programs to pass. + + Compilation with DEBUG defined now works. + + A variable read in with getline was not being treated as a potential + number. + + Array subscripts were not always of string type. + + +Changes from 2.13.2 to 2.14 +--------------------------- + +Updated manual! + +Added "next file" to skip efficiently to the next input file. + +Fixed potential of overflowing buffer in do_sprintf(). + +Plugged small memory leak in sub_common(). + +EOF on a redirect is now "sticky" -- it can only be cleared by close()ing + the pipe or file. + +Now works if used via a #! /bin/gawk line at the top of an executable file + when that line ends with whitespace. + +Added some checks to the grammar to catch redefinition of builtin functions. + This could eventually be the basis for an extension to allow redefining + functions, but in the mean time it's a good error catching facility. + +Negative integer exponents now work. + +Modified do_system() to make sure it had a non-null string to be passed + to system(3). Thus, system("") will flush any pending output but not go + through the overhead of forking an un-needed shell. + +A fix to floating point comparisons so that NaNs compare right on IEEE systems. + +Added code to make sure we're not opening directories for reading and such. + +Added code to do better diagnoses of weird or null file names. + +Allow continue outside of a loop, unless in strict posix mode. Lint option + will issue warning. + +New missing/strftime.c. There has been one chage that affects gawk. Posix + now defines a %V conversion so the vms conversion has been changed to %v. + If this version is used with gawk -Wlint and they use %V in a call to + strftime, they'll get a warning. + +Error messages now conform to GNU standard (I hope). + +Changed comparisons to conform to the description found in the file POSIX. + This is inconsistent with the current POSIX draft, but that is broken. + Hopefully the final POSIX standard will conform to this version. + (Alas, this will have to wait for 1003.2b, which will be a revision to + the 1003.2 standard. That standard has been frozen with the broken + comparison rules.) + +The length of a string was a short and now is a size_t. + +Updated VMS help. + +Added quite a few new tests to the test suite and deleted many due to lack of + written releases. Test output is only removed if it is identical to the + "good" output. + +Fixed a couple of bugs for reference to $0 when $0 is "" -- particularly in + a BEGIN block. + +Fixed premature freeing in construct "$0 = $0". + +Removed the call to wait_any() in gawk_popen(), since on at least some systems, + if gawk's input was from a pipe, the predecssor process in the pipe was a + child of gawk and this caused a deadlock. + +Regexp can (once again) match a newline, if given explicitly. + +nextopen() makes sure file name is null terminated. + +Fixed VMS pipe simulation. Improved VMS I/O performance. + +Catch . used in variable names. + +Fixed bug in getline without redirect from a file -- it was quitting after the + first EOF, rather than trying the next file. + +Fixed bug in treatment of backslash at the end of a string -- it was bombing + rather than doing something sensible. It is not clear what this should mean, + but for now I issue a warning and take it as a literal backslash. + +Moved setting of regexp syntax to before the option parsing in main(), to + handle things like -v FS='[.,;]' + +Fixed bug when NF is set by user -- fields_arr must be expanded if necessary + and "new" fields must be initialized. + +Fixed several bugs in [g]sub() for no match found or the match is 0-length. + +Fixed bug where in gsub() a pattern anchorred at the beginning would still + substitute throughout the string. + +make test does not assume the . is in PATH. + +Fixed bug when a field beyond the end of the record was requested after + $0 was altered (directly or indirectly). + +Fixed bug for assignment to field beyond end of record -- the assigned value + was not found on subsequent reference to that field. + +Fixed bug for FS a regexp and it matches at the end of a record. + +Fixed memory leak for an array local to a function. + +Fixed hanging of pipe redirection to getline + +Fixed coredump on access to $0 inside BEGIN block. + +Fixed treatment of RS = "". It now parses the fields correctly and strips + leading whitspace from a record if FS is a space. + +Fixed faking of /dev/stdin. + +Fixed problem with x += x + +Use of scalar as array and vice versa is now detected. + +IGNORECASE now obeyed for FS (even if FS is a single alphabetic character). + +Switch to GPL version 2. + +Renamed awk.tab.c to awktab.c for MSDOS and VMS tar programs. + +Renamed this file (CHANGES) to NEWS. + +Use fmod() instead of modf() and provide FMOD_MISSING #define to undo + this change. + +Correct the volatile declarations in eval.c. + +Avoid errant closing of the file descriptors for stdin, stdout and stderr. + +Be more flexible about where semi-colons can occur in programs. + +Check for write errors on all output, not just on close(). + +Eliminate the need for missing/{strtol.c,vprintf.c}. + +Use GNU getopt and eliminate missing/getopt.c. + +More "lint" checking. + + +Changes from 2.13.1 to 2.13.2 +----------------------------- + +Toward conformity with GNU standards, configure is a link to mkconf, the latter + to disappear in the next major release. + +Update to config/bsd43. + +Added config/apollo, config/msc60, config/cray2-50, config/interactive2.2 + +sgi33.cc added for compilation using cc ratther than gcc. + +Ultrix41 now propagates to config.h properly -- as part of a general + mechanism in configure for kludges -- #define anything from a config file + just gets tacked onto the end of config.h -- to be used sparingly. + +Got rid of an unnecessary and troublesome declaration of vprintf(). + +Small improvement in locality of error messages. + +Try to diagnose use of array as scalar and vice versa -- to be improved in + the future. + +Fix for last bug fix for Cray division code--sigh. + +More changes to test suite to explicitly use sh. Also get rid of + a few generated files. + +Fixed off-by-one bug in string concatenation code. + +Fix for use of array that is passed in from a previous function parameter. + Addition to test suite for above. + +A number of changes associated with changing NF and access to fields + beyond the end of the current record. + +Change to missing/memcmp.c to avoid seg. fault on zero length input. + +Updates to test suite (including some inadvertently left out of the last patch) + to invoke sh explicitly (rather than rely on #!/bin/sh) and remove some + junk files. test/chem/good updated to correspond to bug fixes. + +Changes from 2.13.0 to 2.13.1 +----------------------------- + +More configs and PORTS. + +Fixed bug wherein a simple division produced an erroneous FPE, caused by + the Cray division workaround -- that code is now #ifdef'd only for + Cray *and* fixed. + +Fixed bug in modulus implementation -- it was very close to the above + code, so I noticed it. + +Fixed portability problem with limits.h in missing.c + +Fixed portability problem with tzname and daylight -- define TZNAME_MISSING + if strftime() is missing and tzname is also. + +Better support for Latin-1 character set. + +Fixed portability problem in test Makefile. + +Updated PROBLEMS file. + +=============================== gawk-2.13 released ========================= +Changes from 2.12.42 to 2.12.43 +------------------------------- + +Typo in awk.y + +Fixed up strftime.3 and added doc. for %V. + +Changes from 2.12.41 to 2.12.42 +------------------------------- + +Fixed bug in devopen() -- if you had write permission in /dev, + it would just create /dev/stdout etc.!! + +Final (?) VMS update. + +Make NeXT use GFMT_WORKAROUND + +Fixed bug in sub_common() for substitute on zero-length match. Improved the + code a bit while I was at it. + +Fixed grammar so that $i++ parses as ($i)++ + +Put support/* back in the distribution (didn't I already do this?!) + +Changes from 2.12.40 to 2.12.41 +------------------------------- + +VMS workaround for broken %g format. + +Changes from 2.12.39 to 2.12.40 +------------------------------- + +Minor man page update. + +Fixed latent bug in redirect(). + +Changes from 2.12.38 to 2.12.39 +------------------------------- + +Updates to test suite -- remove dependence on changing gawk.1 man page. + +Changes from 2.12.37 to 2.12.38 +------------------------------- + +Fixed bug in use of *= without whitespace following. + +VMS update. + +Updates to man page. + +Option handling updates in main.c + +test/manyfiles redone and added to bigtest. + +Fixed latent (on Sun) bug in handling of save_fs. + +Changes from 2.12.36 to 2.12.37 +------------------------------- + +Update REL in Makefile-dist. Incorporate test suite into main distribution. + +Minor fix in regtest. + +Changes from 2.12.35 to 2.12.36 +------------------------------- + +Release takes on dual personality -- 2.12.36 and 2.13.0 -- any further + patches before public release won't count for 2.13, although they will for + 2.12 -- be careful to avoid confusion! patchlevel.h will be the last thing + to change. + +Cray updates to deal with arithmetic problems. + +Minor test suite updates. + +Fixed latent bug in parser (freeing memory). + +Changes from 2.12.34 to 2.12.35 +------------------------------- + +VMS updates. + +Flush stdout at top of err() and stderr at bottom. + +Fixed bug in eval_condition() -- it wasn't testing for MAYBE_NUM and + doing the force_number(). + +Included the missing manyfiles.awk and a new test to catch the above bug which + I am amazed wasn't already caught by the test suite -- it's pretty basic. + +Changes from 2.12.33 to 2.12.34 +------------------------------- + +Atari updates -- including bug fix. + +More VMS updates -- also nuke vms/version.com. + +Fixed bug in handling of large numbers of redirections -- it was probably never + tested before (blush!). + +Minor rearrangement of code in r_force_number(). + +Made chem and regtest tests a bit more portable (Ultrix again). + +Added another test -- manyfiles -- not invoked under any other test -- very Unix + specific. + +Rough beginning of LIMITATIONS file -- need my AWK book to complete it. + +Changes from 2.12.32 to 2.12.33 +------------------------------- + +Expunge debug.? from various files. + +Remove vestiges of Floor and Ceil kludge. + +Special case integer division -- mainly for Cray, but maybe someone else + will benefit. + +Workaround for iop_close closing an output pipe descriptor on Cray -- + not conditional since I think it may fix a bug on SGI as well and I don't + think it can hurt elsewhere. + +Fixed memory leak in assoc_lookup(). + +Small cleanup in test suite. + +Changes from 2.12.31 to 2.12.32 +------------------------------- + +Nuked debug.c and debugging flag -- there are better ways. + +Nuked version.sh and version.c in subdirectories. + +Fixed bug in handling of IGNORECASE. + +Fixed bug when FIELDWIDTHS was set via -v option. + +Fixed (obscure) bug when $0 is assigned a numerical value. + +Fixed so that escape sequences in command-line assignments work (as it already + said in the comment). + +Added a few cases to test suite. + +Moved support/* back into distribution. + +VMS updates. + +Changes from 2.12.30 to 2.12.31 +------------------------------- + +Cosmetic manual page changes. + +Updated sunos3 config. + +Small changes in test suite including renaming files over 14 chars. in length. + +Changes from 2.12.29 to 2.12.30 +------------------------------- + +Bug fix for many string concatenations in a row. + +Changes from 2.12.28 to 2.12.29 +------------------------------- + +Minor cleanup in awk.y + +Minor VMS update. + +Minor atari update. + +Changes from 2.12.27 to 2.12.28 +------------------------------- + +Got rid of the debugging goop in eval.c -- there are better ways. + +Sequent port. + +VMS changes left out of the last patch -- sigh! config/vms.h renamed + to config/vms-conf.h. + +Fixed missing/tzset.c + +Removed use of gcvt() and GCVT_MISSING -- turns out it was no faster than + sprintf("%g") and caused all sorts of portability headaches. + +Tuned get_field() -- it was unnecessarily parsing the whole record on reference + to $0. + +Tuned interpret() a bit in the rule_node loop. + +In r_force_number(), worked around bug in Uglix strtod() and got rid of + ugly do{}while(0) at Michal's urging. + +Replaced do_deref() and deref with unref(node) -- much cleaner and a bit faster. + +Got rid of assign_number() -- contrary to comment, it was no faster than + just making a new node and freeing the old one. + +Replaced make_number() and tmp_number() with macros that call mk_number(). + +Changed freenode() and newnode() into macros -- the latter is getnode() + which calls more_nodes() as necessary. + +Changes from 2.12.26 to 2.12.27 +------------------------------- + +Completion of Cray 2 port (includes a kludge for floor() and ceil() + that may go or be changed -- I think that it may just be working around + a bug in chem that is being tweaked on the Cray). + +More VMS updates. + +Moved kludge over yacc's insertion of malloc and realloc declarations + from protos.h to the Makefile. + +Added a lisp interpreter in awk to the test suite. (Invoked under + bigtest.) + +Cleanup in r_force_number() -- I had never gotten around to a thorough + profile of the cache code and it turns out to be not worth it. + +Performance boost -- do lazy force_number()'ing for fields etc. i.e. + flag them (MAYBE_NUM) and call force_number only as necessary. + +Changes from 2.12.25 to 2.12.26 +------------------------------- + +Rework of regexp stuff so that dynamic regexps have reasonable + performance -- string used for compiled regexp is stored and + compared to new string -- if same, no recompilation is necessary. + Also, very dynamic regexps cause dfa-based searching to be turned + off. + +Code in dev_open() is back to returning fileno(std*) rather than + dup()ing it. This will be documented. Sorry for the run-around + on this. + +Minor atari updates. + +Minor vms update. + +Missing file from MSDOS port. + +Added warning (under lint) if third arg. of [g]sub is a constant and + handle it properly in the code (i.e. return how many matches). + +Changes from 2.12.24 to 2.12.25 +------------------------------- + +MSDOS port. + +Non-consequential changes to regexp variables in preparation for + a more serious change to fix a serious performance problem. + +Changes from 2.12.23 to 2.12.24 +------------------------------- + +Fixed bug in output flushing introduced a few patches back. This caused + serious performance losses. + +Changes from 2.12.22 to 2.12.23 +------------------------------- + +Accidently left config/cray2-60 out of last patch. + +Added some missing dependencies to Makefile. + +Cleaned up mkconf a bit; made yacc the default parser (no alloca needed, + right?); added rs6000 hook for signed characters. + +Made regex.c with NO_ALLOCA undefined work. + +Fixed bug in dfa.c for systems where free(NULL) bombs. + +Deleted a few cant_happen()'s that *really* can't hapen. + +Changes from 2.12.21 to 2.12.22 +------------------------------- + +Added to config stuff the ability to choose YACC rather than bison. + +Fixed CHAR_UNSIGNED in config.h-dist. + +Second arg. of strtod() is char ** rather than const char **. + +stackb is now initially malloc()'ed since it may be realloc()'ed. + +VMS updates. + +Added SIZE_T_MISSING to config stuff and a default typedef to awk.h. + (Maybe it is not needed on any current systems??) + +re_compile_pattern()'s size is now size_t unconditionally. + +Changes from 2.12.20 to 2.12.21 +------------------------------- + +Corrected missing/gcvt.c. + +Got rid of use of dup2() and thus DUP_MISSING. + +Updated config/sgi33. + +Turned on (and fixed) in cmp_nodes() the behaviour that I *hope* will be in + POSIX 1003.2 for relational comparisons. + +Small updates to test suite. + +Changes from 2.12.19 to 2.12.20 +------------------------------- + +Sloppy, sloppy, sloppy!! I didn't even try to compile the last two + patches. This one fixes goofs in regex.c. + +Changes from 2.12.18 to 2.12.19 +------------------------------- + +Cleanup of last patch. + +Changes from 2.12.17 to 2.12.18 +------------------------------- + +Makefile renamed to Makefile-dist. + +Added alloca() configuration to mkconf. (A bit kludgey.) Just + add a single line containing ALLOCA_PW, ALLOCA_S or ALLOCA_C + to the appropriate config file to have Makefile-dist edited + accordingly. + +Reorganized output flushing to correspond with new semantics of + devopen() on "/dev/std*" etc. + +Fixed rest of last goof!! + +Save and restore errno in do_pathopen(). + +Miscellaneous atari updates. + +Get rid of the trailing comma in the NODETYPE definition (Cray + compiler won't take it). + +Try to make the use of `const' consistent since Cray compiler is + fussy about that. See the changes to `basename' and `myname'. + +It turns out that, according to section 3.8.3 (Macro Replacement) + of the ANSI Standard: ``If there are sequences of preprocessing + tokens within the list of arguments that would otherwise act as + preprocessing directives, the behavior is undefined.'' That means + that you cannot count on the behavior of the declaration of + re_compile_pattern in awk.h, and indeed the Cray compiler chokes on it. + +Replaced alloca with malloc/realloc/free in regex.c. It was much simpler + than expected. (Inside NO_ALLOCA for now -- by default no alloca.) + +Added a configuration file, config/cray60, for Unicos-6.0. + +Changes from 2.12.16 to 2.12.17 +------------------------------- + +Ooops. Goofed signal use in last patch. + +Changes from 2.12.15 to 2.12.16 +------------------------------- + +RENAMED *_dir to just * (e.g. missing_dir). + +Numerous VMS changes. + +Proper inclusion of atari and vms files. + +Added experimental (ifdef'd out) RELAXED_CONTINUATION and DEFAULT_FILETYPE + -- please comment on these! + +Moved pathopen() to io.c (sigh). + +Put local directory ahead in default AWKPATH. + +Added facility in mkconf to echo comments on stdout: lines beginning + with "#echo " will have the remainder of the line echoed when mkconf is run. + Any lines starting with "#" will otherwise be treated as comments. The + intent is to be able to say: + "#echo Make sure you uncomment alloca.c in the Makefile" + or the like. + +Prototype fix for V.4 + +Fixed version_string to not print leading @(#). + +Fixed FIELDWIDTHS to work with strict (turned out to be easy). + +Fixed conf for V.2. + +Changed semantics of /dev/fd/n to be like on real /dev/fd. + +Several configuration and updates in the makefile. + +Updated manpage. + +Include tzset.c and system.c from missing_dir that were accidently left out of + the last patch. + +Fixed bug in cmdline variable assignment -- arg was getting freed(!) in + call to variable. + +Backed out of parse-time constant folding for now, until I can figure out + how to do it right. + +Fixed devopen() so that getline <"-" works. + +Changes from 2.12.14 to 2.12.15 +------------------------------- + +Changed config/* to a condensed form that can be used with mkconf to generate + a config.h from config.h-dist -- much easier to maintain. Please chaeck + carefully against what you had before for a particular system and report + any problems. vms.h remains separate since the stuff at the bottom + didn't quite fit the mkconf model -- hopefully cleared up later. + +Fixed bug in grammar -- didn't allow function definition to be separated from + other rules by a semi-colon. + +VMS fix to #includes in missing.c -- should we just be including awk.h? + +Updated README for texinfo.tex version. + +Updating of copyright in all .[chy] files. + +Added but commented out Michal's fix to strftime. + +Added tzset() emulation based on Rick Adams' code. Added TZSET_MISSING to + config.h-dist. + +Added strftime.3 man page for missing_dir + +More posix: func, **, **= don't work in -W posix + +More lint: ^, ^= not in old awk + +gawk.1: removed ref to -DNO_DEV_FD, other minor updating. + +Style change: pushbak becomes pushback() in yylex(). + +Changes from 2.12.13 to 2.12.14 +------------------------------- + +Better (?) organization of awk.h -- attempt to keep all system dependencies + near the top and move some of the non-general things out of the config.h + files. + +Change to handling of SYSTEM_MISSING. + +Small change to ultrix config. + +Do "/dev/fd/*" etc. checking at runtime. + +First pass at VMS port. + +Improvements to error handling (when lexeme spans buffers). + +Fixed backslash handling -- why didn't I notice this sooner? + +Added programs from book to test suite and new target "bigtest" to Makefile. + +Changes from 2.12.12 to 2.12.13 +------------------------------- + +Recognize OFS and ORS specially so that OFS = 9 works without efficiency hit. + Took advantage of opportunity to tune do_print*() for about 10% win on a + print with 5 args (i.e. small but significant). + +Somewhat pervasive changes to reconcile CONVFMT vs. OFMT. + +Better initialization of builtin vars. + +Make config/* consistent wrt STRTOL_MISSING. + +Small portability improvement to alloca.s + +Improvements to lint code in awk.y + +Replaced strtol() with a better one by Chris Torek. + +Changes from 2.12.11 to 2.12.12 +------------------------------- + +Added PORTS file to record successful ports. + +Added #define const to nothing if not STDC and added const to strtod() header. + +Added * to printf capabilities and partially implemented ' ' and '+' (has an + effect for %d only, silently ignored for other formats). I'm afraid that's + as far as I want to go before I look at a complete replacement for + do_sprintf(). + +Added warning for /regexp/ on LHS of MATCHOP. + +Changes from 2.12.10 to 2.12.11 +------------------------------- + +Small Makefile improvements. + +Some remaining nits from the NeXT port. + +Got rid of bcopy() define in awk.h -- not needed anymore (??) + +Changed private in builtin.c -- it is special on Sequent. + +Added subset implementation of strtol() and STRTOL_MISSING. + +A little bit of cleanup in debug.c, dfa.c. + +Changes from 2.12.9 to 2.12.10 +------------------------------ + +Redid compatability checking and checking for # of args. + +Removed all references to variables[] from outside awk.y, in preparation + for a more abstract interface to the symbol table. + +Got rid of a remaining use of bcopy() in regex.c. + +Changes from 2.12.8 to 2.12.9 +----------------------------- + +Portability improvements for atari, next and decstation. + +Bug fix in substr() -- wasn't handling 3rd arg. of -1 properly. + +Manpage updates. + +Moved support from src release to doc release. + +Updated FUTURES file. + +Added some "lint" warnings. + +Changes from 2.12.7 to 2.12.8 +----------------------------- + +Changed time() to systime(). + +Changed warning() in snode() to fatal(). + +strftime() now defaults second arg. to current time. + +Changes from 2.12.6 to 2.12.7 +----------------------------- + +Fixed bug in sub_common() involving inadequate allocation of a buffer. + +Added some missing files to the Makefile. + +Changes from 2.12.5 to 2.12.6 +----------------------------- + +Fixed bug wherein non-redirected getline could call iop_close() just + prior to a call from do_input(). + +Fixed bug in handling of /dev/stdout and /dev/stderr. + +Changes from 2.12.4 to 2.12.5 +----------------------------- + +Updated README and support directory. + +Changes from 2.12.3 to 2.12.4 +----------------------------- + +Updated CHANGES and TODO (should have been done in previous 2 patches). + +Changes from 2.12.2 to 2.12.3 +----------------------------- + +Brought regex.c and alloca.s into line with current FSF versions. + +Changes from 2.12.1 to 2.12.2 +----------------------------- + +Portability improvements; mostly moving system prototypes out of awk.h + +Introduction of strftime. + +Use of CONVFMT. + +Changes from 2.12 to 2.12.1 +----------------------------- + +Consolidated treatment of command-line assignments (thus correcting the +-v treatment). + +Rationalized builtin-variable handling into a table-driven process, thus +simplifying variable() and eliminating spc_var(). + +Fixed bug in handling of command-line source that ended in a newline. + +Simplified install() and lookup(). + +Did away with double-mallocing of identifiers and now free second and later +instances of a name, after the first gets installed into the symbol table. + +Treat IGNORECASE specially, simplifying a lot of code, and allowing +checking against strict conformance only on setting it, rather than on each +pattern match. + +Fixed regexp matching when IGNORECASE is non-zero (broken when dfa.c was +added). + +Fixed bug where $0 was not being marked as valid, even after it was rebuilt. +This caused mangling of $0. + + +Changes from 2.11.1 to 2.12 +----------------------------- + +Makefile: + +Portability improvements in Makefile. +Move configuration stuff into config.h + +FSF files: + +Synchronized alloca.[cs] and regex.[ch] with FSF. + +array.c: + +Rationalized hash routines into one with a different algorithm. +delete() now works if the array is a local variable. +Changed interface of assoc_next() and avoided dereferencing past the end of the + array. + +awk.h: + +Merged non-prototype and prototype declarations in awk.h. +Expanded tree_eval #define to short-circuit more calls of r_tree_eval(). + +awk.y: + +Delinted some of the code in the grammar. +Fixed and improved some of the error message printing. +Changed to accomodate unlimited length source lines. +Line continuation now works as advertised. +Source lines can be arbitrarily long. +Refined grammar hacks so that /= assignment works. Regular expressions + starting with /= are recognized at the beginning of a line, after && or || + and after ~ or !~. More contexts can be added if necessary. +Fixed IGNORECASE (multiple scans for backslash). +Condensed expression_lists in array references. +Detect and warn for correct # args in builtin functions -- call most of them + with a fixed number (i.e. fill in defaults at parse-time rather than at + run-time). +Load ENVIRON only if it is referenced (detected at parse-time). +Treat NF, FS, RS, NR, FNR specially at parse time, to improve run time. +Fold constant expressions at parse time. +Do make_regexp() on third arg. of split() at parse tiem if it is a constant. + +builtin.c: + +srand() returns 0 the first time called. +Replaced alloca() with malloc() in do_sprintf(). +Fixed setting of RSTART and RLENGTH in do_match(). +Got rid of get_{one,two,three} and allowance for variable # of args. at + run-time -- this is now done at parse-time. +Fixed latent bug in [g]sub whereby changes to $0 would never get made. +Rewrote much of sub_common() for simplicity and performance. +Added ctime() and time() builtin functions (unless -DSTRICT). ctime() returns + a time string like the C function, given the number of seconds since the epoch + and time() returns the current time in seconds. +do_sprintf() now checks for mismatch between format string and number of + arguments supplied. + +dfa.c + +This is borrowed (almost unmodified) from GNU grep to provide faster searches. + +eval.c + +Node_var, Node_var_array and Node_param_list handled from macro rather + than in r_tree_eval(). +Changed cmp_nodes() to not do a force_number() -- this, combined with a + force_number() on ARGV[] and ENVIRON[] brings it into line with other awks +Greatly simplified cmp_nodes(). +Separated out Node_NF, Node_FS, Node_RS, Node_NR and Node_FNR in get_lhs(). +All adjacent string concatenations now done at once. + +field.c + +Added support for FIELDWIDTHS. +Fixed bug in get_field() whereby changes to a field were not always + properly reflected in $0. +Reordered tests in parse_field() so that reference off the end of the buffer + doesn't happen. +set_FS() now sets *parse_field i.e. routine to call depending on type of FS. +It also does make_regexp() for FS if needed. get_field() passes FS_regexp + to re_parse_field(), as does do_split(). +Changes to set_field() and set_record() to avoid malloc'ing and free'ing the + field nodes repeatedly. The fields now just point into $0 unless they are + assigned to another variable or changed. force_number() on the field is + *only* done when the field is needed. + +gawk.1 + +Fixed troff formatting problem on .TP lines. + +io.c + +Moved some code out into iop.c. +Output from pipes and system() calls is properly synchronized. +Status from pipe close properly returned. +Bug in getline with no redirect fixed. + +iop.c + +This file contains a totally revamped get_a_record and associated code. + +main.c + +Command line programs no longer use a temporary file. +Therefore, tmpnam() no longer required. +Deprecated -a and -e options -- they will go away in the next release, + but for now they cause a warning. +Moved -C, -V, -c options to -W ala posix. +Added -W posix option: throw out \x +Added -W lint option. + + +node.c + +force_number() now allows pure numerics to have leading whitespace. +Added make_string facility to optimize case of adding an already malloc'd + string. +Cleaned up and simplified do_deref(). +Fixed bug in handling of stref==255 in do_deref(). + +re.c + +contains the interface to regexp code + +Changes from 2.11.1 to FSF version of same +------------------------------------------ +Thu Jan 4 14:19:30 1990 Jim Kingdon (kingdon at albert) + + * Makefile (YACC): Add -y to bison part. + + * missing.c: Add #include <stdio.h>. + +Sun Dec 24 16:16:05 1989 David J. MacKenzie (djm at hobbes.ai.mit.edu) + + * * Makefile: Add (commented out) default defines for Sony News. + + * awk.h: Move declaration of vprintf so it will compile when + -DVPRINTF_MISSING is defined. + +Mon Nov 13 18:54:08 1989 Robert J. Chassell (bob at apple-gunkies.ai.mit.edu) + + * gawk.texinfo: changed @-commands that are not part of the + standard, currently released texinfmt.el to those that are. + Otherwise, only people with the as-yet unreleased makeinfo.c can + format this file. + +Changes from 2.11beta to 2.11.1 (production) +-------------------------------------------- + +Went from "beta" to production status!!! + +Now flushes stdout before closing pipes or redirected files to +synchonize output. + +MS-DOS changes added in. + +Signal handler return type parameterized in Makefile and awk.h and +some lint removed. debug.c cleaned up. + +Fixed FS splitting to never match null strings, per book. + +Correction to the manual's description of FS. + +Some compilers break on char *foo = "string" + 4 so fixed version.sh and +main.c. + +Changes from 2.10beta to 2.11beta +--------------------------------- + +This release fixes all reported bugs that we could reproduce. Probably +some of the changes are not documented here. + +The next release will probably not be a beta release! + +The most important change is the addition of the -nostalgia option. :-) + +The documentation has been improved and brought up-to-date. + +There has been a lot of general cleaning up of the code that is not otherwise +documented here. There has been a movement toward using standard-conforming +library routines and providing them (in missing.d) for systems lacking them. +Improved (hopefully) configuration through Makfile modifications and missing.c. +In particular, straightened out confusion over vprintf #defines, declarations +etc. + +Deleted RCS log comments from source, to reduce source size by about one third. +Most of them were horribly out-of-date, anyway. + +Renamed source files to reflect (for the most part) their contents. + +More and improved error messages. Cleanup and fixes to yyerror(). +String constants are not altered in input buffer, so error messages come out +better. Fixed usage message. Make use of ANSI C strerror() function +(provided). + +Plugged many more memory leaks. The memory consumption is now quite +reasonable over a wide range of programs. + +Uses volatile declaration if STDC > 0 to avoid problems due to longjmp. + +New -a and -e options to use awk or egrep style regexps, respectively, +since POSIX says awk should use egrep regexps. Default is -a. + +Added -v option for setting variables before the first file is encountered. +Version information now uses -V and copyleft uses -C. + +Added a patchlevel.h file and its use for -V and -C. + +Append_right() optimized for major improvement to programs with a *lot* +of statements. + +Operator precedence has been corrected to match draft Posix. + +Tightened up grammar for builtin functions so that only length +may be called without arguments or parentheses. + +/regex/ is now a normal expression that can appear in any expression +context. + +Allow /= to begin a regexp. Allow ..[../..].. in a regexp. + +Allow empty compound statements ({}). + +Made return and next illegal outside a function and in BEGIN/END respectively. + +Division by zero is now illegal and causes a fatal error. + +Fixed exponentiation so that x ^ 0 and x ^= 0 both return 1. + +Fixed do_sqrt, do_log, and do_exp to do argument/return checking and +print an error message, per the manual. + +Fixed main to catch SIGSEGV to get source and data file line numbers. + +Fixed yyerror to print the ^ at the beginning of the bad token, not the end. + +Fix to substr() builtin: it was failing if the arguments +weren't already strings. + +Added new node value flag NUMERIC to indicate that a variable is +purely a number as opposed to type NUM which indicates that +the node's numeric value is valid. This is set in make_number(), +tmp_number and r_force_number() when appropriate and used in +cmp_nodes(). This fixed a bug in comparison of variables that had +numeric prefixes. The new code uses strtod() and eliminates is_a_number(). +A simple strtod() is provided for systems lacking one. It does no +overflow checking, so could be improved. + +Simplification and efficiency improvement in force_string. + +Added performance tweak in r_force_number(). + +Fixed a bug with nested loops and break/continue in functions. + +Fixed inconsistency in handling of empty fields when $0 has to be rebuilt. +Happens to simplify rebuild_record(). + +Cleaned up the code associated with opening a pipe for reading. Gawk +now has its own popen routine (gawk_popen) that allocates an IOBUF +and keeps track of the pid of the child process. gawk_pclose +marks the appropriate child as defunct in the right struct redirect. + +Cleaned up and fixed close_redir(). + +Fixed an obscure bug to do with redirection. Intermingled ">" and ">>" +redirects did not output in a predictable order. + +Improved handling of output bufferring: now all print[f]s redirected to a tty +or pipe are flushed immediately and non-redirected output to a tty is flushed +before the next input record is read. + +Fixed a bug in get_a_record() where bcopy() could have copied over +a random pointer. + +Fixed a bug when RS="" and records separated by multiple blank lines. + +Got rid of SLOWIO code which was out-of-date anyway. + +Fix in get_field() for case where $0 is changed and then $(n) are +changed and then $0 is used. + +Fixed infinite loop on failure to open file for reading from getline. +Now handles redirect file open failures properly. + +Filenames such as /dev/stdin now allowed on the command line as well as +in redirects. + +Fixed so that gawk '$1' where $1 is a zero tests false. + +Fixed parsing so that `RLENGTH -1' parses the same as `RLENGTH - 1', +for example. + +The return from a user-defined function now defaults to the Null node. +This fixes a core-dump-causing bug when the return value of a function +is used and that function returns no value. + +Now catches floating point exceptions to avoid core dumps. + +Bug fix for deleting elements of an array -- under some conditions, it was +deleting more than one element at a time. + +Fix in AWKPATH code for running off the end of the string. + +Fixed handling of precision in *printf calls. %0.2d now works properly, +as does %c. [s]printf now recognizes %i and %X. + +Fixed a bug in printing of very large (>240) strings. + +Cleaned up erroneous behaviour for RS == "". + +Added IGNORECASE support to index(). + +Simplified and fixed newnode/freenode. + +Fixed reference to $(anything) in a BEGIN block. + +Eliminated use of USG rand48(). + +Bug fix in force_string for machines with 16-bit ints. + +Replaced use of mktemp() with tmpnam() and provided a partial implementation of +the latter for systems that don't have it. + +Added a portability check for includes in io.c. + +Minor portability fix in alloc.c plus addition of xmalloc(). + +Portability fix: on UMAX4.2, st_blksize is zero for a pipe, thus breaking +iop_alloc() -- fixed. + +Workaround for compiler bug on Sun386i in do_sprintf. + +More and improved prototypes in awk.h. + +Consolidated C escape parsing code into one place. + +strict flag is now turned on only when invoked with compatability option. +It now applies to fewer things. + +Changed cast of f._ptr in vprintf.c from (unsigned char *) to (char *). +Hopefully this is right for the systems that use this code (I don't). + +Support for pipes under MSDOS added. diff --git a/gnu/usr.bin/awk/PORTS b/gnu/usr.bin/awk/PORTS new file mode 100644 index 000000000000..95e133f9dd03 --- /dev/null +++ b/gnu/usr.bin/awk/PORTS @@ -0,0 +1,32 @@ +A recent version of gawk has been successfully compiled and run "make test" +on the following: + +Sun 4/490 running 4.1 +NeXT running 2.0 +DECstation 3100 running Ultrix 4.0 or Ultrix 3.1 (different config) +AtariST (16-bit ints, gcc compiler, byacc, running under TOS) +ESIX V.3.2 Rev D (== System V Release 3.2), the 386. compiler was gcc + bison +IBM RS/6000 (see README.rs6000) +486 running SVR4, using cc and bison +SGI running IRIX 3.3 using gcc (fails with cc) +Sequent Balance running Dynix V3.1 +Cray Y-MP8 running Unicos 6.0.11 +Cray 2 running Unicos 6.1 (modulo trailing zeroes in chem) +VAX/VMS V5.x (should also work on 4.6 and 4.7) +VMS POSIX V1.0, V1.1 +OpenVMS AXP V1.0 +MSDOS - Microsoft C 5.1, compiles and runs very simple testing +BSD 4.4alpha + +From: ghazi@caip.rutgers.edu (Kaveh R. Ghazi): + +arch configured as: +---- -------------- +Hpux 9.0 hpux8x +NeXTStep 2.0 next20 +Sgi Irix 4.0.5 (/bin/cc) sgi405.cc (new file) +Stardent Titan 1500 OSv2.5 sysv3 +Stardent Vistra (i860) SVR4 sysv4 +SunOS 4.1.2 sunos41 +Tektronix XD88 (UTekV 3.2e) sysv3 +Ultrix 4.2 ultrix41 diff --git a/gnu/usr.bin/awk/POSIX b/gnu/usr.bin/awk/POSIX new file mode 100644 index 000000000000..f2405420aedf --- /dev/null +++ b/gnu/usr.bin/awk/POSIX @@ -0,0 +1,95 @@ +Right now, the numeric vs. string comparisons are screwed up in draft +11.2. What prompted me to check it out was the note in gnu.bug.utils +which observed that gawk was doing the comparison $1 == "000" +numerically. I think that we can agree that intuitively, this should +be done as a string comparison. Version 2.13.2 of gawk follows the +current POSIX draft. Following is how I (now) think this +stuff should be done. + +1. A numeric literal or the result of a numeric operation has the NUMERIC + attribute. + +2. A string literal or the result of a string operation has the STRING + attribute. + +3. Fields, getline input, FILENAME, ARGV elements, ENVIRON elements and the + elements of an array created by split() that are numeric strings + have the STRNUM attribute. Otherwise, they have the STRING attribute. + Uninitialized variables also have the STRNUM attribute. + +4. Attributes propagate across assignments, but are not changed by + any use. (Although a use may cause the entity to acquire an additional + value such that it has both a numeric and string value -- this leaves the + attribute unchanged.) + +When two operands are compared, either string comparison or numeric comparison +may be used, depending on the attributes of the operands, according to the +following (symmetric) matrix: + + +---------------------------------------------- + | STRING NUMERIC STRNUM +--------+---------------------------------------------- + | +STRING | string string string + | +NUMERIC | string numeric numeric + | +STRNUM | string numeric numeric +--------+---------------------------------------------- + +So, the following program should print all OKs. + +echo '0e2 0a 0 0b +0e2 0a 0 0b' | +$AWK ' +NR == 1 { + num = 0 + str = "0e2" + + print ++test ": " ( (str == "0e2") ? "OK" : "OOPS" ) + print ++test ": " ( ("0e2" != 0) ? "OK" : "OOPS" ) + print ++test ": " ( ("0" != $2) ? "OK" : "OOPS" ) + print ++test ": " ( ("0e2" == $1) ? "OK" : "OOPS" ) + + print ++test ": " ( (0 == "0") ? "OK" : "OOPS" ) + print ++test ": " ( (0 == num) ? "OK" : "OOPS" ) + print ++test ": " ( (0 != $2) ? "OK" : "OOPS" ) + print ++test ": " ( (0 == $1) ? "OK" : "OOPS" ) + + print ++test ": " ( ($1 != "0") ? "OK" : "OOPS" ) + print ++test ": " ( ($1 == num) ? "OK" : "OOPS" ) + print ++test ": " ( ($2 != 0) ? "OK" : "OOPS" ) + print ++test ": " ( ($2 != $1) ? "OK" : "OOPS" ) + print ++test ": " ( ($3 == 0) ? "OK" : "OOPS" ) + print ++test ": " ( ($3 == $1) ? "OK" : "OOPS" ) + print ++test ": " ( ($2 != $4) ? "OK" : "OOPS" ) # 15 +} +{ + a = "+2" + b = 2 + if (NR % 2) + c = a + b + print ++test ": " ( (a != b) ? "OK" : "OOPS" ) # 16 and 22 + + d = "2a" + b = 2 + if (NR % 2) + c = d + b + print ++test ": " ( (d != b) ? "OK" : "OOPS" ) + + print ++test ": " ( (d + 0 == b) ? "OK" : "OOPS" ) + + e = "2" + print ++test ": " ( (e == b "") ? "OK" : "OOPS" ) + + a = "2.13" + print ++test ": " ( (a == 2.13) ? "OK" : "OOPS" ) + + a = "2.130000" + print ++test ": " ( (a != 2.13) ? "OK" : "OOPS" ) + + if (NR == 2) { + CONVFMT = "%.6f" + print ++test ": " ( (a == 2.13) ? "OK" : "OOPS" ) + } +}' diff --git a/gnu/usr.bin/awk/PROBLEMS b/gnu/usr.bin/awk/PROBLEMS new file mode 100644 index 000000000000..3b7c5148bd8e --- /dev/null +++ b/gnu/usr.bin/awk/PROBLEMS @@ -0,0 +1,6 @@ +This is a list of known problems in gawk 2.15. +Hopefully they will all be fixed in the next major release of gawk. + +Please keep in mind that the code is still undergoing significant evolution. + +1. Gawk's printf is probably still not POSIX compliant. diff --git a/gnu/usr.bin/awk/README b/gnu/usr.bin/awk/README new file mode 100644 index 000000000000..f4bd3df806c8 --- /dev/null +++ b/gnu/usr.bin/awk/README @@ -0,0 +1,116 @@ +README: + +This is GNU Awk 2.15. It should be upwardly compatible with the +System V Release 4 awk. It is almost completely compliant with draft 11.3 +of POSIX 1003.2. + +This release adds new features -- see NEWS for details. + +See the installation instructions, below. + +Known problems are given in the PROBLEMS file. Work to be done is +described briefly in the FUTURES file. Verified ports are listed in +the PORTS file. Changes in this version are summarized in the CHANGES file. +Please read the LIMITATIONS and ACKNOWLEDGMENT files. + +Read the file POSIX for a discussion of how the standard says comparisons +should be done vs. how they really should be done and how gawk does them. + +To format the documentation with TeX, you must use texinfo.tex 2.53 +or later. Otherwise footnotes look unacceptable. + +If you wish to remake the Info files, you should use makeinfo. The 2.15 +version of makeinfo works with no errors. + +The man page is up to date. + +INSTALLATION: + +Check whether there is a system-specific README file for your system. + +Makefile.in may need some tailoring. The only changes necessary should +be to change installation targets or to change compiler flags. +The changes to make in Makefile.in are commented and should be obvious. + +All other changes should be made in a config file. Samples for +various systems are included in the config directory. Starting with +2.11, our intent has been to make the code conform to standards (ANSI, +POSIX, SVID, in that order) whenever possible, and to not penalize +standard conforming systems. We have included substitute versions of +routines not universally available. Simply add the appropriate define +for the missing feature(s) on your system. + +If you have neither bison nor yacc, use the awktab.c file here. It was +generated with bison, and should have no AT&T code in it. (Note that +modifying awk.y without bison or yacc will be difficult, at best. You might +want to get a copy of bison from the FSF too.) + +If no config file is included for your system, start by copying one +for a similar system. One way of determining the defines needed is to +try to load gawk with nothing defined and see what routines are +unresolved by the loader. This should give you a good idea of how to +proceed. + +The next release will use the FSF autoconfig program, so we are no longer +soliciting new config files. + +If you have an MS-DOS system, use the stuff in the pc directory. +For an Atari there is an atari directory and similarly one for VMS. + +Chapter 16 of The GAWK Manual discusses configuration in detail. + +After successful compilation, do 'make test' to run a small test +suite. There should be no output from the 'cmp' invocations except in +the cases where there are small differences in floating point values. +If there are other differences, please investigate and report the +problem. + +PRINTING THE MANUAL + +The 'support' directory contains texinfo.tex 2.65, which will be necessary +for printing the manual, and the texindex.c program from the texinfo +distribution which is also necessary. See the makefile for the steps needed +to get a DVI file from the manual. + +CAVEATS + +The existence of a patchlevel.h file does *N*O*T* imply a commitment on +our part to issue bug fixes or patches. It is there in case we should +decide to do so. + +BUG REPORTS AND FIXES (Un*x systems): + +Please coordinate changes through David Trueman and/or Arnold Robbins. + +David Trueman +Department of Mathematics, Statistics and Computing Science, +Dalhousie University, Halifax, Nova Scotia, Canada + +UUCP: {uunet utai watmath}!dalcs!david +INTERNET: david@cs.dal.ca + +Arnold Robbins +1736 Reindeer Drive +Atlanta, GA, 30329, USA + +INTERNET: arnold@skeeve.atl.ga.us +UUCP: { gatech, emory, emoryu1 }!skeeve!arnold + +BUG REPORTS AND FIXES (non-Unix ports): + +MS-DOS: + Scott Deifik + AMGEN Inc. + Amgen Center, Bldg.17-Dept.393 + Thousand Oaks, CA 91320-1789 + Tel-805-499-5725 ext.4677 + Fax-805-498-0358 + scottd@amgen.com + +VMS: + Pat Rankin + rankin@eql.caltech.edu (e-mail only) + +Atari ST: + Michal Jaegermann + NTOMCZAK@vm.ucs.UAlberta.CA (e-mail only) diff --git a/gnu/usr.bin/awk/array.c b/gnu/usr.bin/awk/array.c new file mode 100644 index 000000000000..59be340c04df --- /dev/null +++ b/gnu/usr.bin/awk/array.c @@ -0,0 +1,293 @@ +/* + * array.c - routines for associative arrays. + */ + +/* + * Copyright (C) 1986, 1988, 1989, 1991, 1992 the Free Software Foundation, Inc. + * + * This file is part of GAWK, the GNU implementation of the + * AWK Progamming Language. + * + * GAWK is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * GAWK is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with GAWK; see the file COPYING. If not, write to + * the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +#include "awk.h" + +static NODE *assoc_find P((NODE *symbol, NODE *subs, int hash1)); + +NODE * +concat_exp(tree) +register NODE *tree; +{ + register NODE *r; + char *str; + char *s; + unsigned len; + int offset; + int subseplen; + char *subsep; + + if (tree->type != Node_expression_list) + return force_string(tree_eval(tree)); + r = force_string(tree_eval(tree->lnode)); + if (tree->rnode == NULL) + return r; + subseplen = SUBSEP_node->lnode->stlen; + subsep = SUBSEP_node->lnode->stptr; + len = r->stlen + subseplen + 2; + emalloc(str, char *, len, "concat_exp"); + memcpy(str, r->stptr, r->stlen+1); + s = str + r->stlen; + free_temp(r); + tree = tree->rnode; + while (tree) { + if (subseplen == 1) + *s++ = *subsep; + else { + memcpy(s, subsep, subseplen+1); + s += subseplen; + } + r = force_string(tree_eval(tree->lnode)); + len += r->stlen + subseplen; + offset = s - str; + erealloc(str, char *, len, "concat_exp"); + s = str + offset; + memcpy(s, r->stptr, r->stlen+1); + s += r->stlen; + free_temp(r); + tree = tree->rnode; + } + r = make_str_node(str, s - str, ALREADY_MALLOCED); + r->flags |= TEMP; + return r; +} + +/* Flush all the values in symbol[] before doing a split() */ +void +assoc_clear(symbol) +NODE *symbol; +{ + int i; + NODE *bucket, *next; + + if (symbol->var_array == 0) + return; + for (i = 0; i < HASHSIZE; i++) { + for (bucket = symbol->var_array[i]; bucket; bucket = next) { + next = bucket->ahnext; + unref(bucket->ahname); + unref(bucket->ahvalue); + freenode(bucket); + } + symbol->var_array[i] = 0; + } +} + +/* + * calculate the hash function of the string in subs + */ +unsigned int +hash(s, len) +register char *s; +register int len; +{ + register unsigned long h = 0, g; + + while (len--) { + h = (h << 4) + *s++; + g = (h & 0xf0000000); + if (g) { + h = h ^ (g >> 24); + h = h ^ g; + } + } + if (h < HASHSIZE) + return h; + else + return h%HASHSIZE; +} + +/* + * locate symbol[subs] + */ +static NODE * /* NULL if not found */ +assoc_find(symbol, subs, hash1) +NODE *symbol; +register NODE *subs; +int hash1; +{ + register NODE *bucket, *prev = 0; + + for (bucket = symbol->var_array[hash1]; bucket; bucket = bucket->ahnext) { + if (cmp_nodes(bucket->ahname, subs) == 0) { + if (prev) { /* move found to front of chain */ + prev->ahnext = bucket->ahnext; + bucket->ahnext = symbol->var_array[hash1]; + symbol->var_array[hash1] = bucket; + } + return bucket; + } else + prev = bucket; /* save previous list entry */ + } + return NULL; +} + +/* + * test whether the array element symbol[subs] exists or not + */ +int +in_array(symbol, subs) +NODE *symbol, *subs; +{ + register int hash1; + + if (symbol->type == Node_param_list) + symbol = stack_ptr[symbol->param_cnt]; + if (symbol->var_array == 0) + return 0; + subs = concat_exp(subs); /* concat_exp returns a string node */ + hash1 = hash(subs->stptr, subs->stlen); + if (assoc_find(symbol, subs, hash1) == NULL) { + free_temp(subs); + return 0; + } else { + free_temp(subs); + return 1; + } +} + +/* + * SYMBOL is the address of the node (or other pointer) being dereferenced. + * SUBS is a number or string used as the subscript. + * + * Find SYMBOL[SUBS] in the assoc array. Install it with value "" if it + * isn't there. Returns a pointer ala get_lhs to where its value is stored + */ +NODE ** +assoc_lookup(symbol, subs) +NODE *symbol, *subs; +{ + register int hash1; + register NODE *bucket; + + (void) force_string(subs); + hash1 = hash(subs->stptr, subs->stlen); + + if (symbol->var_array == 0) { /* this table really should grow + * dynamically */ + unsigned size; + + size = sizeof(NODE *) * HASHSIZE; + emalloc(symbol->var_array, NODE **, size, "assoc_lookup"); + memset((char *)symbol->var_array, 0, size); + symbol->type = Node_var_array; + } else { + bucket = assoc_find(symbol, subs, hash1); + if (bucket != NULL) { + free_temp(subs); + return &(bucket->ahvalue); + } + } + + /* It's not there, install it. */ + if (do_lint && subs->stlen == 0) + warning("subscript of array `%s' is null string", + symbol->vname); + getnode(bucket); + bucket->type = Node_ahash; + if (subs->flags & TEMP) + bucket->ahname = dupnode(subs); + else { + unsigned int saveflags = subs->flags; + + subs->flags &= ~MALLOC; + bucket->ahname = dupnode(subs); + subs->flags = saveflags; + } + free_temp(subs); + + /* array subscripts are strings */ + bucket->ahname->flags &= ~NUMBER; + bucket->ahname->flags |= STRING; + bucket->ahvalue = Nnull_string; + bucket->ahnext = symbol->var_array[hash1]; + symbol->var_array[hash1] = bucket; + return &(bucket->ahvalue); +} + +void +do_delete(symbol, tree) +NODE *symbol, *tree; +{ + register int hash1; + register NODE *bucket, *last; + NODE *subs; + + if (symbol->type == Node_param_list) + symbol = stack_ptr[symbol->param_cnt]; + if (symbol->var_array == 0) + return; + subs = concat_exp(tree); /* concat_exp returns string node */ + hash1 = hash(subs->stptr, subs->stlen); + + last = NULL; + for (bucket = symbol->var_array[hash1]; bucket; last = bucket, bucket = bucket->ahnext) + if (cmp_nodes(bucket->ahname, subs) == 0) + break; + free_temp(subs); + if (bucket == NULL) + return; + if (last) + last->ahnext = bucket->ahnext; + else + symbol->var_array[hash1] = bucket->ahnext; + unref(bucket->ahname); + unref(bucket->ahvalue); + freenode(bucket); +} + +void +assoc_scan(symbol, lookat) +NODE *symbol; +struct search *lookat; +{ + if (!symbol->var_array) { + lookat->retval = NULL; + return; + } + lookat->arr_ptr = symbol->var_array; + lookat->arr_end = lookat->arr_ptr + HASHSIZE; /* added */ + lookat->bucket = symbol->var_array[0]; + assoc_next(lookat); +} + +void +assoc_next(lookat) +struct search *lookat; +{ + while (lookat->arr_ptr < lookat->arr_end) { + if (lookat->bucket != 0) { + lookat->retval = lookat->bucket->ahname; + lookat->bucket = lookat->bucket->ahnext; + return; + } + lookat->arr_ptr++; + if (lookat->arr_ptr < lookat->arr_end) + lookat->bucket = *(lookat->arr_ptr); + else + lookat->retval = NULL; + } + return; +} diff --git a/gnu/usr.bin/awk/awk.1 b/gnu/usr.bin/awk/awk.1 new file mode 100644 index 000000000000..0338485e8db8 --- /dev/null +++ b/gnu/usr.bin/awk/awk.1 @@ -0,0 +1,1873 @@ +.ds PX \s-1POSIX\s+1 +.ds UX \s-1UNIX\s+1 +.ds AN \s-1ANSI\s+1 +.TH GAWK 1 "Apr 15 1993" "Free Software Foundation" "Utility Commands" +.SH NAME +gawk \- pattern scanning and processing language +.SH SYNOPSIS +.B gawk +[ POSIX or GNU style options ] +.B \-f +.I program-file +[ +.B \-\^\- +] file .\^.\^. +.br +.B gawk +[ POSIX or GNU style options ] +[ +.B \-\^\- +] +.I program-text +file .\^.\^. +.SH DESCRIPTION +.I Gawk +is the GNU Project's implementation of the AWK programming language. +It conforms to the definition of the language in +the \*(PX 1003.2 Command Language And Utilities Standard. +This version in turn is based on the description in +.IR "The AWK Programming Language" , +by Aho, Kernighan, and Weinberger, +with the additional features defined in the System V Release 4 version +of \*(UX +.IR awk . +.I Gawk +also provides some GNU-specific extensions. +.PP +The command line consists of options to +.I gawk +itself, the AWK program text (if not supplied via the +.B \-f +or +.B \-\^\-file +options), and values to be made +available in the +.B ARGC +and +.B ARGV +pre-defined AWK variables. +.SH OPTIONS +.PP +.I Gawk +options may be either the traditional \*(PX one letter options, +or the GNU style long options. \*(PX style options start with a single ``\-'', +while GNU long options start with ``\-\^\-''. +GNU style long options are provided for both GNU-specific features and +for \*(PX mandated features. Other implementations of the AWK language +are likely to only accept the traditional one letter options. +.PP +Following the \*(PX standard, +.IR gawk -specific +options are supplied via arguments to the +.B \-W +option. Multiple +.B \-W +options may be supplied, or multiple arguments may be supplied together +if they are separated by commas, or enclosed in quotes and separated +by white space. +Case is ignored in arguments to the +.B \-W +option. +Each +.B \-W +option has a corresponding GNU style long option, as detailed below. +.PP +.I Gawk +accepts the following options. +.TP +.PD 0 +.BI \-F " fs" +.TP +.PD +.BI \-\^\-field-separator= fs +Use +.I fs +for the input field separator (the value of the +.B FS +predefined +variable). +.TP +.PD 0 +\fB\-v\fI var\fB\^=\^\fIval\fR +.TP +.PD +\fB\-\^\-assign=\fIvar\fB\^=\^\fIval\fR +Assign the value +.IR val , +to the variable +.IR var , +before execution of the program begins. +Such variable values are available to the +.B BEGIN +block of an AWK program. +.TP +.PD 0 +.BI \-f " program-file" +.TP +.PD +.BI \-\^\-file= program-file +Read the AWK program source from the file +.IR program-file , +instead of from the first command line argument. +Multiple +.B \-f +(or +.BR \-\^\-file ) +options may be used. +.TP \w'\fB\-\^\-copyright\fR'u+1n +.PD 0 +.B "\-W compat" +.TP +.PD +.B \-\^\-compat +Run in +.I compatibility +mode. In compatibility mode, +.I gawk +behaves identically to \*(UX +.IR awk ; +none of the GNU-specific extensions are recognized. +See +.BR "GNU EXTENSIONS" , +below, for more information. +.TP +.PD 0 +.B "\-W copyleft" +.TP +.PD 0 +.B "\-W copyright" +.TP +.PD 0 +.B \-\^\-copyleft +.TP +.PD +.B \-\^\-copyright +Print the short version of the GNU copyright information message on +the error output. +.TP +.PD 0 +.B "\-W help" +.TP +.PD 0 +.B "\-W usage" +.TP +.PD 0 +.B \-\^\-help +.TP +.PD +.B \-\^\-usage +Print a relatively short summary of the available options on +the error output. +.TP +.PD 0 +.B "\-W lint" +.TP +.PD 0 +.B \-\^\-lint +Provide warnings about constructs that are +dubious or non-portable to other AWK implementations. +.ig +.\" This option is left undocumented, on purpose. +.TP +.PD 0 +.B "\-W nostalgia" +.TP +.PD +.B \-\^\-nostalgia +Provide a moment of nostalgia for long time +.I awk +users. +.. +.TP +.PD 0 +.B "\-W posix" +.TP +.PD +.B \-\^\-posix +This turns on +.I compatibility +mode, with the following additional restrictions: +.RS +.TP \w'\(bu'u+1n +\(bu +.B \ex +escape sequences are not recognized. +.TP +\(bu +The synonym +.B func +for the keyword +.B function +is not recognized. +.TP +\(bu +The operators +.B ** +and +.B **= +cannot be used in place of +.B ^ +and +.BR ^= . +.RE +.TP +.PD 0 +.BI "\-W source=" program-text +.TP +.PD +.BI \-\^\-source= program-text +Use +.I program-text +as AWK program source code. +This option allows the easy intermixing of library functions (used via the +.B \-f +and +.B \-\^\-file +options) with source code entered on the command line. +It is intended primarily for medium to large size AWK programs used +in shell scripts. +.sp .5 +The +.B "\-W source=" +form of this option uses the rest of the command line argument for +.IR program-text ; +no other options to +.B \-W +will be recognized in the same argument. +.TP +.PD 0 +.B "\-W version" +.TP +.PD +.B \-\^\-version +Print version information for this particular copy of +.I gawk +on the error output. +This is useful mainly for knowing if the current copy of +.I gawk +on your system +is up to date with respect to whatever the Free Software Foundation +is distributing. +.TP +.B \-\^\- +Signal the end of options. This is useful to allow further arguments to the +AWK program itself to start with a ``\-''. +This is mainly for consistency with the argument parsing convention used +by most other \*(PX programs. +.PP +Any other options are flagged as illegal, but are otherwise ignored. +.SH AWK PROGRAM EXECUTION +.PP +An AWK program consists of a sequence of pattern-action statements +and optional function definitions. +.RS +.PP +\fIpattern\fB { \fIaction statements\fB }\fR +.br +\fBfunction \fIname\fB(\fIparameter list\fB) { \fIstatements\fB }\fR +.RE +.PP +.I Gawk +first reads the program source from the +.IR program-file (s) +if specified, or from the first non-option argument on the command line. +The +.B \-f +option may be used multiple times on the command line. +.I Gawk +will read the program text as if all the +.IR program-file s +had been concatenated together. This is useful for building libraries +of AWK functions, without having to include them in each new AWK +program that uses them. To use a library function in a file from a +program typed in on the command line, specify +.B /dev/tty +as one of the +.IR program-file s, +type your program, and end it with a +.B ^D +(control-d). +.PP +The environment variable +.B AWKPATH +specifies a search path to use when finding source files named with +the +.B \-f +option. If this variable does not exist, the default path is +\fB".:/usr/lib/awk:/usr/local/lib/awk"\fR. +If a file name given to the +.B \-f +option contains a ``/'' character, no path search is performed. +.PP +.I Gawk +executes AWK programs in the following order. +First, +.I gawk +compiles the program into an internal form. +Next, all variable assignments specified via the +.B \-v +option are performed. Then, +.I gawk +executes the code in the +.B BEGIN +block(s) (if any), +and then proceeds to read +each file named in the +.B ARGV +array. +If there are no files named on the command line, +.I gawk +reads the standard input. +.PP +If a filename on the command line has the form +.IB var = val +it is treated as a variable assignment. The variable +.I var +will be assigned the value +.IR val . +(This happens after any +.B BEGIN +block(s) have been run.) +Command line variable assignment +is most useful for dynamically assigning values to the variables +AWK uses to control how input is broken into fields and records. It +is also useful for controlling state if multiple passes are needed over +a single data file. +.PP +If the value of a particular element of +.B ARGV +is empty (\fB""\fR), +.I gawk +skips over it. +.PP +For each line in the input, +.I gawk +tests to see if it matches any +.I pattern +in the AWK program. +For each pattern that the line matches, the associated +.I action +is executed. +The patterns are tested in the order they occur in the program. +.PP +Finally, after all the input is exhausted, +.I gawk +executes the code in the +.B END +block(s) (if any). +.SH VARIABLES AND FIELDS +AWK variables are dynamic; they come into existence when they are +first used. Their values are either floating-point numbers or strings, +or both, +depending upon how they are used. AWK also has one dimension +arrays; multiply dimensioned arrays may be simulated. +Several pre-defined variables are set as a program +runs; these will be described as needed and summarized below. +.SS Fields +.PP +As each input line is read, +.I gawk +splits the line into +.IR fields , +using the value of the +.B FS +variable as the field separator. +If +.B FS +is a single character, fields are separated by that character. +Otherwise, +.B FS +is expected to be a full regular expression. +In the special case that +.B FS +is a single blank, fields are separated +by runs of blanks and/or tabs. +Note that the value of +.B IGNORECASE +(see below) will also affect how fields are split when +.B FS +is a regular expression. +.PP +If the +.B FIELDWIDTHS +variable is set to a space separated list of numbers, each field is +expected to have fixed width, and +.I gawk +will split up the record using the specified widths. The value of +.B FS +is ignored. +Assigning a new value to +.B FS +overrides the use of +.BR FIELDWIDTHS , +and restores the default behavior. +.PP +Each field in the input line may be referenced by its position, +.BR $1 , +.BR $2 , +and so on. +.B $0 +is the whole line. The value of a field may be assigned to as well. +Fields need not be referenced by constants: +.RS +.PP +.ft B +n = 5 +.br +print $n +.ft R +.RE +.PP +prints the fifth field in the input line. +The variable +.B NF +is set to the total number of fields in the input line. +.PP +References to non-existent fields (i.e. fields after +.BR $NF ) +produce the null-string. However, assigning to a non-existent field +(e.g., +.BR "$(NF+2) = 5" ) +will increase the value of +.BR NF , +create any intervening fields with the null string as their value, and +cause the value of +.B $0 +to be recomputed, with the fields being separated by the value of +.BR OFS . +.SS Built-in Variables +.PP +AWK's built-in variables are: +.PP +.TP \w'\fBFIELDWIDTHS\fR'u+1n +.B ARGC +The number of command line arguments (does not include options to +.IR gawk , +or the program source). +.TP +.B ARGIND +The index in +.B ARGV +of the current file being processed. +.TP +.B ARGV +Array of command line arguments. The array is indexed from +0 to +.B ARGC +\- 1. +Dynamically changing the contents of +.B ARGV +can control the files used for data. +.TP +.B CONVFMT +The conversion format for numbers, \fB"%.6g"\fR, by default. +.TP +.B ENVIRON +An array containing the values of the current environment. +The array is indexed by the environment variables, each element being +the value of that variable (e.g., \fBENVIRON["HOME"]\fP might be +.BR /u/arnold ). +Changing this array does not affect the environment seen by programs which +.I gawk +spawns via redirection or the +.B system() +function. +(This may change in a future version of +.IR gawk .) +.\" but don't hold your breath... +.TP +.B ERRNO +If a system error occurs either doing a redirection for +.BR getline , +during a read for +.BR getline , +or during a +.BR close , +then +.B ERRNO +will contain +a string describing the error. +.TP +.B FIELDWIDTHS +A white-space separated list of fieldwidths. When set, +.I gawk +parses the input into fields of fixed width, instead of using the +value of the +.B FS +variable as the field separator. +The fixed field width facility is still experimental; expect the +semantics to change as +.I gawk +evolves over time. +.TP +.B FILENAME +The name of the current input file. +If no files are specified on the command line, the value of +.B FILENAME +is ``\-''. +.TP +.B FNR +The input record number in the current input file. +.TP +.B FS +The input field separator, a blank by default. +.TP +.B IGNORECASE +Controls the case-sensitivity of all regular expression operations. If +.B IGNORECASE +has a non-zero value, then pattern matching in rules, +field splitting with +.BR FS , +regular expression +matching with +.B ~ +and +.BR !~ , +and the +.BR gsub() , +.BR index() , +.BR match() , +.BR split() , +and +.B sub() +pre-defined functions will all ignore case when doing regular expression +operations. Thus, if +.B IGNORECASE +is not equal to zero, +.B /aB/ +matches all of the strings \fB"ab"\fP, \fB"aB"\fP, \fB"Ab"\fP, +and \fB"AB"\fP. +As with all AWK variables, the initial value of +.B IGNORECASE +is zero, so all regular expression operations are normally case-sensitive. +.TP +.B NF +The number of fields in the current input record. +.TP +.B NR +The total number of input records seen so far. +.TP +.B OFMT +The output format for numbers, \fB"%.6g"\fR, by default. +.TP +.B OFS +The output field separator, a blank by default. +.TP +.B ORS +The output record separator, by default a newline. +.TP +.B RS +The input record separator, by default a newline. +.B RS +is exceptional in that only the first character of its string +value is used for separating records. +(This will probably change in a future release of +.IR gawk .) +If +.B RS +is set to the null string, then records are separated by +blank lines. +When +.B RS +is set to the null string, then the newline character always acts as +a field separator, in addition to whatever value +.B FS +may have. +.TP +.B RSTART +The index of the first character matched by +.BR match() ; +0 if no match. +.TP +.B RLENGTH +The length of the string matched by +.BR match() ; +\-1 if no match. +.TP +.B SUBSEP +The character used to separate multiple subscripts in array +elements, by default \fB"\e034"\fR. +.SS Arrays +.PP +Arrays are subscripted with an expression between square brackets +.RB ( [ " and " ] ). +If the expression is an expression list +.RI ( expr ", " expr " ...)" +then the array subscript is a string consisting of the +concatenation of the (string) value of each expression, +separated by the value of the +.B SUBSEP +variable. +This facility is used to simulate multiply dimensioned +arrays. For example: +.PP +.RS +.ft B +i = "A" ;\^ j = "B" ;\^ k = "C" +.br +x[i, j, k] = "hello, world\en" +.ft R +.RE +.PP +assigns the string \fB"hello, world\en"\fR to the element of the array +.B x +which is indexed by the string \fB"A\e034B\e034C"\fR. All arrays in AWK +are associative, i.e. indexed by string values. +.PP +The special operator +.B in +may be used in an +.B if +or +.B while +statement to see if an array has an index consisting of a particular +value. +.PP +.RS +.ft B +.nf +if (val in array) + print array[val] +.fi +.ft +.RE +.PP +If the array has multiple subscripts, use +.BR "(i, j) in array" . +.PP +The +.B in +construct may also be used in a +.B for +loop to iterate over all the elements of an array. +.PP +An element may be deleted from an array using the +.B delete +statement. +.SS Variable Typing And Conversion +.PP +Variables and fields +may be (floating point) numbers, or strings, or both. How the +value of a variable is interpreted depends upon its context. If used in +a numeric expression, it will be treated as a number, if used as a string +it will be treated as a string. +.PP +To force a variable to be treated as a number, add 0 to it; to force it +to be treated as a string, concatenate it with the null string. +.PP +When a string must be converted to a number, the conversion is accomplished +using +.IR atof (3). +A number is converted to a string by using the value of +.B CONVFMT +as a format string for +.IR sprintf (3), +with the numeric value of the variable as the argument. +However, even though all numbers in AWK are floating-point, +integral values are +.I always +converted as integers. Thus, given +.PP +.RS +.ft B +.nf +CONVFMT = "%2.2f" +a = 12 +b = a "" +.fi +.ft R +.RE +.PP +the variable +.B b +has a value of \fB"12"\fR and not \fB"12.00"\fR. +.PP +.I Gawk +performs comparisons as follows: +If two variables are numeric, they are compared numerically. +If one value is numeric and the other has a string value that is a +``numeric string,'' then comparisons are also done numerically. +Otherwise, the numeric value is converted to a string and a string +comparison is performed. +Two strings are compared, of course, as strings. +According to the \*(PX standard, even if two strings are +numeric strings, a numeric comparison is performed. However, this is +clearly incorrect, and +.I gawk +does not do this. +.PP +Uninitialized variables have the numeric value 0 and the string value "" +(the null, or empty, string). +.SH PATTERNS AND ACTIONS +AWK is a line oriented language. The pattern comes first, and then the +action. Action statements are enclosed in +.B { +and +.BR } . +Either the pattern may be missing, or the action may be missing, but, +of course, not both. If the pattern is missing, the action will be +executed for every single line of input. +A missing action is equivalent to +.RS +.PP +.B "{ print }" +.RE +.PP +which prints the entire line. +.PP +Comments begin with the ``#'' character, and continue until the +end of the line. +Blank lines may be used to separate statements. +Normally, a statement ends with a newline, however, this is not the +case for lines ending in +a ``,'', ``{'', ``?'', ``:'', ``&&'', or ``||''. +Lines ending in +.B do +or +.B else +also have their statements automatically continued on the following line. +In other cases, a line can be continued by ending it with a ``\e'', +in which case the newline will be ignored. +.PP +Multiple statements may +be put on one line by separating them with a ``;''. +This applies to both the statements within the action part of a +pattern-action pair (the usual case), +and to the pattern-action statements themselves. +.SS Patterns +AWK patterns may be one of the following: +.PP +.RS +.nf +.B BEGIN +.B END +.BI / "regular expression" / +.I "relational expression" +.IB pattern " && " pattern +.IB pattern " || " pattern +.IB pattern " ? " pattern " : " pattern +.BI ( pattern ) +.BI ! " pattern" +.IB pattern1 ", " pattern2 +.fi +.RE +.PP +.B BEGIN +and +.B END +are two special kinds of patterns which are not tested against +the input. +The action parts of all +.B BEGIN +patterns are merged as if all the statements had +been written in a single +.B BEGIN +block. They are executed before any +of the input is read. Similarly, all the +.B END +blocks are merged, +and executed when all the input is exhausted (or when an +.B exit +statement is executed). +.B BEGIN +and +.B END +patterns cannot be combined with other patterns in pattern expressions. +.B BEGIN +and +.B END +patterns cannot have missing action parts. +.PP +For +.BI / "regular expression" / +patterns, the associated statement is executed for each input line that matches +the regular expression. +Regular expressions are the same as those in +.IR egrep (1), +and are summarized below. +.PP +A +.I "relational expression" +may use any of the operators defined below in the section on actions. +These generally test whether certain fields match certain regular expressions. +.PP +The +.BR && , +.BR || , +and +.B ! +operators are logical AND, logical OR, and logical NOT, respectively, as in C. +They do short-circuit evaluation, also as in C, and are used for combining +more primitive pattern expressions. As in most languages, parentheses +may be used to change the order of evaluation. +.PP +The +.B ?\^: +operator is like the same operator in C. If the first pattern is true +then the pattern used for testing is the second pattern, otherwise it is +the third. Only one of the second and third patterns is evaluated. +.PP +The +.IB pattern1 ", " pattern2 +form of an expression is called a range pattern. +It matches all input records starting with a line that matches +.IR pattern1 , +and continuing until a record that matches +.IR pattern2 , +inclusive. It does not combine with any other sort of pattern expression. +.SS Regular Expressions +Regular expressions are the extended kind found in +.IR egrep . +They are composed of characters as follows: +.TP \w'\fB[^\fIabc...\fB]\fR'u+2n +.I c +matches the non-metacharacter +.IR c . +.TP +.I \ec +matches the literal character +.IR c . +.TP +.B . +matches any character except newline. +.TP +.B ^ +matches the beginning of a line or a string. +.TP +.B $ +matches the end of a line or a string. +.TP +.BI [ abc... ] +character class, matches any of the characters +.IR abc... . +.TP +.BI [^ abc... ] +negated character class, matches any character except +.I abc... +and newline. +.TP +.IB r1 | r2 +alternation: matches either +.I r1 +or +.IR r2 . +.TP +.I r1r2 +concatenation: matches +.IR r1 , +and then +.IR r2 . +.TP +.IB r + +matches one or more +.IR r 's. +.TP +.IB r * +matches zero or more +.IR r 's. +.TP +.IB r ? +matches zero or one +.IR r 's. +.TP +.BI ( r ) +grouping: matches +.IR r . +.PP +The escape sequences that are valid in string constants (see below) +are also legal in regular expressions. +.SS Actions +Action statements are enclosed in braces, +.B { +and +.BR } . +Action statements consist of the usual assignment, conditional, and looping +statements found in most languages. The operators, control statements, +and input/output statements +available are patterned after those in C. +.SS Operators +.PP +The operators in AWK, in order of increasing precedence, are +.PP +.TP "\w'\fB*= /= %= ^=\fR'u+1n" +.PD 0 +.B "= += \-=" +.TP +.PD +.B "*= /= %= ^=" +Assignment. Both absolute assignment +.BI ( var " = " value ) +and operator-assignment (the other forms) are supported. +.TP +.B ?: +The C conditional expression. This has the form +.IB expr1 " ? " expr2 " : " expr3\c +\&. If +.I expr1 +is true, the value of the expression is +.IR expr2 , +otherwise it is +.IR expr3 . +Only one of +.I expr2 +and +.I expr3 +is evaluated. +.TP +.B || +Logical OR. +.TP +.B && +Logical AND. +.TP +.B "~ !~" +Regular expression match, negated match. +.B NOTE: +Do not use a constant regular expression +.RB ( /foo/ ) +on the left-hand side of a +.B ~ +or +.BR !~ . +Only use one on the right-hand side. The expression +.BI "/foo/ ~ " exp +has the same meaning as \fB(($0 ~ /foo/) ~ \fIexp\fB)\fR. +This is usually +.I not +what was intended. +.TP +.PD 0 +.B "< >" +.TP +.PD 0 +.B "<= >=" +.TP +.PD +.B "!= ==" +The regular relational operators. +.TP +.I blank +String concatenation. +.TP +.B "+ \-" +Addition and subtraction. +.TP +.B "* / %" +Multiplication, division, and modulus. +.TP +.B "+ \- !" +Unary plus, unary minus, and logical negation. +.TP +.B ^ +Exponentiation (\fB**\fR may also be used, and \fB**=\fR for +the assignment operator). +.TP +.B "++ \-\^\-" +Increment and decrement, both prefix and postfix. +.TP +.B $ +Field reference. +.SS Control Statements +.PP +The control statements are +as follows: +.PP +.RS +.nf +\fBif (\fIcondition\fB) \fIstatement\fR [ \fBelse\fI statement \fR] +\fBwhile (\fIcondition\fB) \fIstatement \fR +\fBdo \fIstatement \fBwhile (\fIcondition\fB)\fR +\fBfor (\fIexpr1\fB; \fIexpr2\fB; \fIexpr3\fB) \fIstatement\fR +\fBfor (\fIvar \fBin\fI array\fB) \fIstatement\fR +\fBbreak\fR +\fBcontinue\fR +\fBdelete \fIarray\^\fB[\^\fIindex\^\fB]\fR +\fBexit\fR [ \fIexpression\fR ] +\fB{ \fIstatements \fB} +.fi +.RE +.SS "I/O Statements" +.PP +The input/output statements are as follows: +.PP +.TP "\w'\fBprintf \fIfmt, expr-list\fR'u+1n" +.BI close( filename ) +Close file (or pipe, see below). +.TP +.B getline +Set +.B $0 +from next input record; set +.BR NF , +.BR NR , +.BR FNR . +.TP +.BI "getline <" file +Set +.B $0 +from next record of +.IR file ; +set +.BR NF . +.TP +.BI getline " var" +Set +.I var +from next input record; set +.BR NF , +.BR FNR . +.TP +.BI getline " var" " <" file +Set +.I var +from next record of +.IR file . +.TP +.B next +Stop processing the current input record. The next input record +is read and processing starts over with the first pattern in the +AWK program. If the end of the input data is reached, the +.B END +block(s), if any, are executed. +.TP +.B "next file" +Stop processing the current input file. The next input record read +comes from the next input file. +.B FILENAME +is updated, +.B FNR +is reset to 1, and processing starts over with the first pattern in the +AWK program. If the end of the input data is reached, the +.B END +block(s), if any, are executed. +.TP +.B print +Prints the current record. +.TP +.BI print " expr-list" +Prints expressions. +.TP +.BI print " expr-list" " >" file +Prints expressions on +.IR file . +.TP +.BI printf " fmt, expr-list" +Format and print. +.TP +.BI printf " fmt, expr-list" " >" file +Format and print on +.IR file . +.TP +.BI system( cmd-line ) +Execute the command +.IR cmd-line , +and return the exit status. +(This may not be available on non-\*(PX systems.) +.PP +Other input/output redirections are also allowed. For +.B print +and +.BR printf , +.BI >> file +appends output to the +.IR file , +while +.BI | " command" +writes on a pipe. +In a similar fashion, +.IB command " | getline" +pipes into +.BR getline . +.BR Getline +will return 0 on end of file, and \-1 on an error. +.SS The \fIprintf\fP\^ Statement +.PP +The AWK versions of the +.B printf +statement and +.B sprintf() +function +(see below) +accept the following conversion specification formats: +.TP +.B %c +An \s-1ASCII\s+1 character. +If the argument used for +.B %c +is numeric, it is treated as a character and printed. +Otherwise, the argument is assumed to be a string, and the only first +character of that string is printed. +.TP +.B %d +A decimal number (the integer part). +.TP +.B %i +Just like +.BR %d . +.TP +.B %e +A floating point number of the form +.BR [\-]d.ddddddE[+\^\-]dd . +.TP +.B %f +A floating point number of the form +.BR [\-]ddd.dddddd . +.TP +.B %g +Use +.B e +or +.B f +conversion, whichever is shorter, with nonsignificant zeros suppressed. +.TP +.B %o +An unsigned octal number (again, an integer). +.TP +.B %s +A character string. +.TP +.B %x +An unsigned hexadecimal number (an integer). +.TP +.B %X +Like +.BR %x , +but using +.B ABCDEF +instead of +.BR abcdef . +.TP +.B %% +A single +.B % +character; no argument is converted. +.PP +There are optional, additional parameters that may lie between the +.B % +and the control letter: +.TP +.B \- +The expression should be left-justified within its field. +.TP +.I width +The field should be padded to this width. If the number has a leading +zero, then the field will be padded with zeros. +Otherwise it is padded with blanks. +.TP +.BI . prec +A number indicating the maximum width of strings or digits to the right +of the decimal point. +.PP +The dynamic +.I width +and +.I prec +capabilities of the \*(AN C +.B printf() +routines are supported. +A +.B * +in place of either the +.B width +or +.B prec +specifications will cause their values to be taken from +the argument list to +.B printf +or +.BR sprintf() . +.SS Special File Names +.PP +When doing I/O redirection from either +.B print +or +.B printf +into a file, +or via +.B getline +from a file, +.I gawk +recognizes certain special filenames internally. These filenames +allow access to open file descriptors inherited from +.IR gawk 's +parent process (usually the shell). +Other special filenames provide access information about the running +.B gawk +process. +The filenames are: +.TP \w'\fB/dev/stdout\fR'u+1n +.B /dev/pid +Reading this file returns the process ID of the current process, +in decimal, terminated with a newline. +.TP +.B /dev/ppid +Reading this file returns the parent process ID of the current process, +in decimal, terminated with a newline. +.TP +.B /dev/pgrpid +Reading this file returns the process group ID of the current process, +in decimal, terminated with a newline. +.TP +.B /dev/user +Reading this file returns a single record terminated with a newline. +The fields are separated with blanks. +.B $1 +is the value of the +.IR getuid (2) +system call, +.B $2 +is the value of the +.IR geteuid (2) +system call, +.B $3 +is the value of the +.IR getgid (2) +system call, and +.B $4 +is the value of the +.IR getegid (2) +system call. +If there are any additional fields, they are the group IDs returned by +.IR getgroups (2). +(Multiple groups may not be supported on all systems.) +.TP +.B /dev/stdin +The standard input. +.TP +.B /dev/stdout +The standard output. +.TP +.B /dev/stderr +The standard error output. +.TP +.BI /dev/fd/\^ n +The file associated with the open file descriptor +.IR n . +.PP +These are particularly useful for error messages. For example: +.PP +.RS +.ft B +print "You blew it!" > "/dev/stderr" +.ft R +.RE +.PP +whereas you would otherwise have to use +.PP +.RS +.ft B +print "You blew it!" | "cat 1>&2" +.ft R +.RE +.PP +These file names may also be used on the command line to name data files. +.SS Numeric Functions +.PP +AWK has the following pre-defined arithmetic functions: +.PP +.TP \w'\fBsrand(\^\fIexpr\^\fB)\fR'u+1n +.BI atan2( y , " x" ) +returns the arctangent of +.I y/x +in radians. +.TP +.BI cos( expr ) +returns the cosine in radians. +.TP +.BI exp( expr ) +the exponential function. +.TP +.BI int( expr ) +truncates to integer. +.TP +.BI log( expr ) +the natural logarithm function. +.TP +.B rand() +returns a random number between 0 and 1. +.TP +.BI sin( expr ) +returns the sine in radians. +.TP +.BI sqrt( expr ) +the square root function. +.TP +.BI srand( expr ) +use +.I expr +as a new seed for the random number generator. If no +.I expr +is provided, the time of day will be used. +The return value is the previous seed for the random +number generator. +.SS String Functions +.PP +AWK has the following pre-defined string functions: +.PP +.TP "\w'\fBsprintf(\^\fIfmt\fB\^, \fIexpr-list\^\fB)\fR'u+1n" +\fBgsub(\fIr\fB, \fIs\fB, \fIt\fB)\fR +for each substring matching the regular expression +.I r +in the string +.IR t , +substitute the string +.IR s , +and return the number of substitutions. +If +.I t +is not supplied, use +.BR $0 . +.TP +.BI index( s , " t" ) +returns the index of the string +.I t +in the string +.IR s , +or 0 if +.I t +is not present. +.TP +.BI length( s ) +returns the length of the string +.IR s , +or the length of +.B $0 +if +.I s +is not supplied. +.TP +.BI match( s , " r" ) +returns the position in +.I s +where the regular expression +.I r +occurs, or 0 if +.I r +is not present, and sets the values of +.B RSTART +and +.BR RLENGTH . +.TP +\fBsplit(\fIs\fB, \fIa\fB, \fIr\fB)\fR +splits the string +.I s +into the array +.I a +on the regular expression +.IR r , +and returns the number of fields. If +.I r +is omitted, +.B FS +is used instead. +.TP +.BI sprintf( fmt , " expr-list" ) +prints +.I expr-list +according to +.IR fmt , +and returns the resulting string. +.TP +\fBsub(\fIr\fB, \fIs\fB, \fIt\fB)\fR +just like +.BR gsub() , +but only the first matching substring is replaced. +.TP +\fBsubstr(\fIs\fB, \fIi\fB, \fIn\fB)\fR +returns the +.IR n -character +substring of +.I s +starting at +.IR i . +If +.I n +is omitted, the rest of +.I s +is used. +.TP +.BI tolower( str ) +returns a copy of the string +.IR str , +with all the upper-case characters in +.I str +translated to their corresponding lower-case counterparts. +Non-alphabetic characters are left unchanged. +.TP +.BI toupper( str ) +returns a copy of the string +.IR str , +with all the lower-case characters in +.I str +translated to their corresponding upper-case counterparts. +Non-alphabetic characters are left unchanged. +.SS Time Functions +.PP +Since one of the primary uses of AWK programs is processing log files +that contain time stamp information, +.I gawk +provides the following two functions for obtaining time stamps and +formatting them. +.PP +.TP "\w'\fBsystime()\fR'u+1n" +.B systime() +returns the current time of day as the number of seconds since the Epoch +(Midnight UTC, January 1, 1970 on \*(PX systems). +.TP +\fBstrftime(\fIformat\fR, \fItimestamp\fB)\fR +formats +.I timestamp +according to the specification in +.IR format. +The +.I timestamp +should be of the same form as returned by +.BR systime() . +If +.I timestamp +is missing, the current time of day is used. +See the specification for the +.B strftime() +function in \*(AN C for the format conversions that are +guaranteed to be available. +A public-domain version of +.IR strftime (3) +and a man page for it are shipped with +.IR gawk ; +if that version was used to build +.IR gawk , +then all of the conversions described in that man page are available to +.IR gawk. +.SS String Constants +.PP +String constants in AWK are sequences of characters enclosed +between double quotes (\fB"\fR). Within strings, certain +.I "escape sequences" +are recognized, as in C. These are: +.PP +.TP \w'\fB\e\^\fIddd\fR'u+1n +.B \e\e +A literal backslash. +.TP +.B \ea +The ``alert'' character; usually the \s-1ASCII\s+1 \s-1BEL\s+1 character. +.TP +.B \eb +backspace. +.TP +.B \ef +form-feed. +.TP +.B \en +new line. +.TP +.B \er +carriage return. +.TP +.B \et +horizontal tab. +.TP +.B \ev +vertical tab. +.TP +.BI \ex "\^hex digits" +The character represented by the string of hexadecimal digits following +the +.BR \ex . +As in \*(AN C, all following hexadecimal digits are considered part of +the escape sequence. +(This feature should tell us something about language design by committee.) +E.g., "\ex1B" is the \s-1ASCII\s+1 \s-1ESC\s+1 (escape) character. +.TP +.BI \e ddd +The character represented by the 1-, 2-, or 3-digit sequence of octal +digits. E.g. "\e033" is the \s-1ASCII\s+1 \s-1ESC\s+1 (escape) character. +.TP +.BI \e c +The literal character +.IR c\^ . +.PP +The escape sequences may also be used inside constant regular expressions +(e.g., +.B "/[\ \et\ef\en\er\ev]/" +matches whitespace characters). +.SH FUNCTIONS +Functions in AWK are defined as follows: +.PP +.RS +\fBfunction \fIname\fB(\fIparameter list\fB) { \fIstatements \fB}\fR +.RE +.PP +Functions are executed when called from within the action parts of regular +pattern-action statements. Actual parameters supplied in the function +call are used to instantiate the formal parameters declared in the function. +Arrays are passed by reference, other variables are passed by value. +.PP +Since functions were not originally part of the AWK language, the provision +for local variables is rather clumsy: They are declared as extra parameters +in the parameter list. The convention is to separate local variables from +real parameters by extra spaces in the parameter list. For example: +.PP +.RS +.ft B +.nf +function f(p, q, a, b) { # a & b are local + ..... } + +/abc/ { ... ; f(1, 2) ; ... } +.fi +.ft R +.RE +.PP +The left parenthesis in a function call is required +to immediately follow the function name, +without any intervening white space. +This is to avoid a syntactic ambiguity with the concatenation operator. +This restriction does not apply to the built-in functions listed above. +.PP +Functions may call each other and may be recursive. +Function parameters used as local variables are initialized +to the null string and the number zero upon function invocation. +.PP +The word +.B func +may be used in place of +.BR function . +.SH EXAMPLES +.nf +Print and sort the login names of all users: + +.ft B + BEGIN { FS = ":" } + { print $1 | "sort" } + +.ft R +Count lines in a file: + +.ft B + { nlines++ } + END { print nlines } + +.ft R +Precede each line by its number in the file: + +.ft B + { print FNR, $0 } + +.ft R +Concatenate and line number (a variation on a theme): + +.ft B + { print NR, $0 } +.ft R +.fi +.SH SEE ALSO +.IR egrep (1) +.PP +.IR "The AWK Programming Language" , +Alfred V. Aho, Brian W. Kernighan, Peter J. Weinberger, +Addison-Wesley, 1988. ISBN 0-201-07981-X. +.PP +.IR "The GAWK Manual" , +Edition 0.15, published by the Free Software Foundation, 1993. +.SH POSIX COMPATIBILITY +A primary goal for +.I gawk +is compatibility with the \*(PX standard, as well as with the +latest version of \*(UX +.IR awk . +To this end, +.I gawk +incorporates the following user visible +features which are not described in the AWK book, +but are part of +.I awk +in System V Release 4, and are in the \*(PX standard. +.PP +The +.B \-v +option for assigning variables before program execution starts is new. +The book indicates that command line variable assignment happens when +.I awk +would otherwise open the argument as a file, which is after the +.B BEGIN +block is executed. However, in earlier implementations, when such an +assignment appeared before any file names, the assignment would happen +.I before +the +.B BEGIN +block was run. Applications came to depend on this ``feature.'' +When +.I awk +was changed to match its documentation, this option was added to +accomodate applications that depended upon the old behavior. +(This feature was agreed upon by both the AT&T and GNU developers.) +.PP +The +.B \-W +option for implementation specific features is from the \*(PX standard. +.PP +When processing arguments, +.I gawk +uses the special option ``\fB\-\^\-\fP'' to signal the end of +arguments, and warns about, but otherwise ignores, undefined options. +.PP +The AWK book does not define the return value of +.BR srand() . +The System V Release 4 version of \*(UX +.I awk +(and the \*(PX standard) +has it return the seed it was using, to allow keeping track +of random number sequences. Therefore +.B srand() +in +.I gawk +also returns its current seed. +.PP +Other new features are: +The use of multiple +.B \-f +options (from MKS +.IR awk ); +the +.B ENVIRON +array; the +.BR \ea , +and +.BR \ev +escape sequences (done originally in +.I gawk +and fed back into AT&T's); the +.B tolower() +and +.B toupper() +built-in functions (from AT&T); and the \*(AN C conversion specifications in +.B printf +(done first in AT&T's version). +.SH GNU EXTENSIONS +.I Gawk +has some extensions to \*(PX +.IR awk . +They are described in this section. All the extensions described here +can be disabled by +invoking +.I gawk +with the +.B "\-W compat" +option. +.PP +The following features of +.I gawk +are not available in +\*(PX +.IR awk . +.RS +.TP \w'\(bu'u+1n +\(bu +The +.B \ex +escape sequence. +.TP +\(bu +The +.B systime() +and +.B strftime() +functions. +.TP +\(bu +The special file names available for I/O redirection are not recognized. +.TP +\(bu +The +.B ARGIND +and +.B ERRNO +variables are not special. +.TP +\(bu +The +.B IGNORECASE +variable and its side-effects are not available. +.TP +\(bu +The +.B FIELDWIDTHS +variable and fixed width field splitting. +.TP +\(bu +No path search is performed for files named via the +.B \-f +option. Therefore the +.B AWKPATH +environment variable is not special. +.TP +\(bu +The use of +.B "next file" +to abandon processing of the current input file. +.RE +.PP +The AWK book does not define the return value of the +.B close() +function. +.IR Gawk\^ 's +.B close() +returns the value from +.IR fclose (3), +or +.IR pclose (3), +when closing a file or pipe, respectively. +.PP +When +.I gawk +is invoked with the +.B "\-W compat" +option, +if the +.I fs +argument to the +.B \-F +option is ``t'', then +.B FS +will be set to the tab character. +Since this is a rather ugly special case, it is not the default behavior. +This behavior also does not occur if +.B \-Wposix +has been specified. +.ig +.PP +If +.I gawk +was compiled for debugging, it will +accept the following additional options: +.TP +.PD 0 +.B \-Wparsedebug +.TP +.PD +.B \-\^\-parsedebug +Turn on +.IR yacc (1) +or +.IR bison (1) +debugging output during program parsing. +This option should only be of interest to the +.I gawk +maintainers, and may not even be compiled into +.IR gawk . +.. +.SH HISTORICAL FEATURES +There are two features of historical AWK implementations that +.I gawk +supports. +First, it is possible to call the +.B length() +built-in function not only with no argument, but even without parentheses! +Thus, +.RS +.PP +.ft B +a = length +.ft R +.RE +.PP +is the same as either of +.RS +.PP +.ft B +a = length() +.br +a = length($0) +.ft R +.RE +.PP +This feature is marked as ``deprecated'' in the \*(PX standard, and +.I gawk +will issue a warning about its use if +.B \-Wlint +is specified on the command line. +.PP +The other feature is the use of the +.B continue +statement outside the body of a +.BR while , +.BR for , +or +.B do +loop. Traditional AWK implementations have treated such usage as +equivalent to the +.B next +statement. +.I Gawk +will support this usage if +.B \-Wposix +has not been specified. +.SH BUGS +The +.B \-F +option is not necessary given the command line variable assignment feature; +it remains only for backwards compatibility. +.PP +If your system actually has support for +.B /dev/fd +and the associated +.BR /dev/stdin , +.BR /dev/stdout , +and +.B /dev/stderr +files, you may get different output from +.I gawk +than you would get on a system without those files. When +.I gawk +interprets these files internally, it synchronizes output to the standard +output with output to +.BR /dev/stdout , +while on a system with those files, the output is actually to different +open files. +Caveat Emptor. +.SH VERSION INFORMATION +This man page documents +.IR gawk , +version 2.15. +.PP +Starting with the 2.15 version of +.IR gawk , +the +.BR \-c , +.BR \-V , +.BR \-C , +.ig +.BR \-D , +.. +.BR \-a , +and +.B \-e +options of the 2.11 version are no longer recognized. +.SH AUTHORS +The original version of \*(UX +.I awk +was designed and implemented by Alfred Aho, +Peter Weinberger, and Brian Kernighan of AT&T Bell Labs. Brian Kernighan +continues to maintain and enhance it. +.PP +Paul Rubin and Jay Fenlason, +of the Free Software Foundation, wrote +.IR gawk , +to be compatible with the original version of +.I awk +distributed in Seventh Edition \*(UX. +John Woods contributed a number of bug fixes. +David Trueman, with contributions +from Arnold Robbins, made +.I gawk +compatible with the new version of \*(UX +.IR awk . +.PP +The initial DOS port was done by Conrad Kwok and Scott Garfinkle. +Scott Deifik is the current DOS maintainer. Pat Rankin did the +port to VMS, and Michal Jaegermann did the port to the Atari ST. +.SH ACKNOWLEDGEMENTS +Brian Kernighan of Bell Labs +provided valuable assistance during testing and debugging. +We thank him. diff --git a/gnu/usr.bin/awk/awk.h b/gnu/usr.bin/awk/awk.h new file mode 100644 index 000000000000..ca3997f11d4b --- /dev/null +++ b/gnu/usr.bin/awk/awk.h @@ -0,0 +1,763 @@ +/* + * awk.h -- Definitions for gawk. + */ + +/* + * Copyright (C) 1986, 1988, 1989, 1991, 1992 the Free Software Foundation, Inc. + * + * This file is part of GAWK, the GNU implementation of the + * AWK Progamming Language. + * + * GAWK is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * GAWK is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with GAWK; see the file COPYING. If not, write to + * the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +/* ------------------------------ Includes ------------------------------ */ +#include <stdio.h> +#include <limits.h> +#include <ctype.h> +#include <setjmp.h> +#include <varargs.h> +#include <time.h> +#include <errno.h> +#if !defined(errno) && !defined(MSDOS) +extern int errno; +#endif +#ifdef __GNU_LIBRARY__ +#ifndef linux +#include <signum.h> +#endif +#endif + +/* ----------------- System dependencies (with more includes) -----------*/ + +#if !defined(VMS) || (!defined(VAXC) && !defined(__DECC)) +#include <sys/types.h> +#include <sys/stat.h> +#else /* VMS w/ VAXC or DECC */ +#include <types.h> +#include <stat.h> +#include <file.h> /* avoid <fcntl.h> in io.c */ +#endif + +#include <signal.h> + +#include "config.h" + +#ifdef __STDC__ +#define P(s) s +#define MALLOC_ARG_T size_t +#else +#define P(s) () +#define MALLOC_ARG_T unsigned +#define volatile +#define const +#endif + +#ifndef SIGTYPE +#define SIGTYPE void +#endif + +#ifdef SIZE_T_MISSING +typedef unsigned int size_t; +#endif + +#ifndef SZTC +#define SZTC +#define INTC +#endif + +#ifdef STDC_HEADERS +#include <stdlib.h> +#include <string.h> +#ifdef NeXT +#include <libc.h> +#undef atof +#else +#if defined(atarist) || defined(VMS) +#include <unixlib.h> +#else /* atarist || VMS */ +#ifndef MSDOS +#include <unistd.h> +#endif /* MSDOS */ +#endif /* atarist || VMS */ +#endif /* Next */ +#else /* STDC_HEADERS */ +#include "protos.h" +#endif /* STDC_HEADERS */ + +#if defined(ultrix) && !defined(Ultrix41) +extern char * getenv P((char *name)); +extern double atof P((char *s)); +#endif + +#ifndef __GNUC__ +#ifdef sparc +/* nasty nasty SunOS-ism */ +#include <alloca.h> +#ifdef lint +extern char *alloca(); +#endif +#else /* not sparc */ +#if !defined(alloca) && !defined(ALLOCA_PROTO) +extern char *alloca(); +#endif +#endif /* sparc */ +#endif /* __GNUC__ */ + +#ifdef HAVE_UNDERSCORE_SETJMP +/* nasty nasty berkelixm */ +#define setjmp _setjmp +#define longjmp _longjmp +#endif + +/* + * if you don't have vprintf, try this and cross your fingers. + */ +#if defined(VPRINTF_MISSING) +#define vfprintf(fp,fmt,arg) _doprnt((fmt), (arg), (fp)) +#endif + +#ifdef VMS +/* some macros to redirect to code in vms/vms_misc.c */ +#define exit vms_exit +#define open vms_open +#define strerror vms_strerror +#define strdup vms_strdup +extern void exit P((int)); +extern int open P((const char *,int,...)); +extern char *strerror P((int)); +extern char *strdup P((const char *str)); +extern int vms_devopen P((const char *,int)); +# ifndef NO_TTY_FWRITE +#define fwrite tty_fwrite +#define fclose tty_fclose +extern size_t fwrite P((const void *,size_t,size_t,FILE *)); +extern int fclose P((FILE *)); +# endif +extern FILE *popen P((const char *,const char *)); +extern int pclose P((FILE *)); +extern void vms_arg_fixup P((int *,char ***)); +/* some things not in STDC_HEADERS */ +extern int gnu_strftime P((char *,size_t,const char *,const struct tm *)); +extern int unlink P((const char *)); +extern int getopt P((int,char **,char *)); +extern int isatty P((int)); +#ifndef fileno +extern int fileno P((FILE *)); +#endif +extern int close(), dup(), dup2(), fstat(), read(), stat(); +#endif /*VMS*/ + +#ifdef MSDOS +#include <io.h> +extern FILE *popen P((char *, char *)); +extern int pclose P((FILE *)); +#endif + +#define GNU_REGEX +#ifdef GNU_REGEX +#include "regex.h" +#include "dfa.h" +typedef struct Regexp { + struct re_pattern_buffer pat; + struct re_registers regs; + struct regexp dfareg; + int dfa; +} Regexp; +#define RESTART(rp,s) (rp)->regs.start[0] +#define REEND(rp,s) (rp)->regs.end[0] +#else /* GNU_REGEX */ +#endif /* GNU_REGEX */ + +#ifdef atarist +#define read _text_read /* we do not want all these CR's to mess our input */ +extern int _text_read (int, char *, int); +#endif + +#ifndef DEFPATH +#define DEFPATH ".:/usr/local/lib/awk:/usr/lib/awk" +#endif + +#ifndef ENVSEP +#define ENVSEP ':' +#endif + +/* ------------------ Constants, Structures, Typedefs ------------------ */ +#define AWKNUM double + +typedef enum { + /* illegal entry == 0 */ + Node_illegal, + + /* binary operators lnode and rnode are the expressions to work on */ + Node_times, + Node_quotient, + Node_mod, + Node_plus, + Node_minus, + Node_cond_pair, /* conditional pair (see Node_line_range) */ + Node_subscript, + Node_concat, + Node_exp, + + /* unary operators subnode is the expression to work on */ +/*10*/ Node_preincrement, + Node_predecrement, + Node_postincrement, + Node_postdecrement, + Node_unary_minus, + Node_field_spec, + + /* assignments lnode is the var to assign to, rnode is the exp */ + Node_assign, + Node_assign_times, + Node_assign_quotient, + Node_assign_mod, +/*20*/ Node_assign_plus, + Node_assign_minus, + Node_assign_exp, + + /* boolean binaries lnode and rnode are expressions */ + Node_and, + Node_or, + + /* binary relationals compares lnode and rnode */ + Node_equal, + Node_notequal, + Node_less, + Node_greater, + Node_leq, +/*30*/ Node_geq, + Node_match, + Node_nomatch, + + /* unary relationals works on subnode */ + Node_not, + + /* program structures */ + Node_rule_list, /* lnode is a rule, rnode is rest of list */ + Node_rule_node, /* lnode is pattern, rnode is statement */ + Node_statement_list, /* lnode is statement, rnode is more list */ + Node_if_branches, /* lnode is to run on true, rnode on false */ + Node_expression_list, /* lnode is an exp, rnode is more list */ + Node_param_list, /* lnode is a variable, rnode is more list */ + + /* keywords */ +/*40*/ Node_K_if, /* lnode is conditonal, rnode is if_branches */ + Node_K_while, /* lnode is condtional, rnode is stuff to run */ + Node_K_for, /* lnode is for_struct, rnode is stuff to run */ + Node_K_arrayfor, /* lnode is for_struct, rnode is stuff to run */ + Node_K_break, /* no subs */ + Node_K_continue, /* no stuff */ + Node_K_print, /* lnode is exp_list, rnode is redirect */ + Node_K_printf, /* lnode is exp_list, rnode is redirect */ + Node_K_next, /* no subs */ + Node_K_exit, /* subnode is return value, or NULL */ +/*50*/ Node_K_do, /* lnode is conditional, rnode stuff to run */ + Node_K_return, + Node_K_delete, + Node_K_getline, + Node_K_function, /* lnode is statement list, rnode is params */ + + /* I/O redirection for print statements */ + Node_redirect_output, /* subnode is where to redirect */ + Node_redirect_append, /* subnode is where to redirect */ + Node_redirect_pipe, /* subnode is where to redirect */ + Node_redirect_pipein, /* subnode is where to redirect */ + Node_redirect_input, /* subnode is where to redirect */ + + /* Variables */ +/*60*/ Node_var, /* rnode is value, lnode is array stuff */ + Node_var_array, /* array is ptr to elements, asize num of + * eles */ + Node_val, /* node is a value - type in flags */ + + /* Builtins subnode is explist to work on, proc is func to call */ + Node_builtin, + + /* + * pattern: conditional ',' conditional ; lnode of Node_line_range + * is the two conditionals (Node_cond_pair), other word (rnode place) + * is a flag indicating whether or not this range has been entered. + */ + Node_line_range, + + /* + * boolean test of membership in array lnode is string-valued + * expression rnode is array name + */ + Node_in_array, + + Node_func, /* lnode is param. list, rnode is body */ + Node_func_call, /* lnode is name, rnode is argument list */ + + Node_cond_exp, /* lnode is conditonal, rnode is if_branches */ + Node_regex, +/*70*/ Node_hashnode, + Node_ahash, + Node_NF, + Node_NR, + Node_FNR, + Node_FS, + Node_RS, + Node_FIELDWIDTHS, + Node_IGNORECASE, + Node_OFS, + Node_ORS, + Node_OFMT, + Node_CONVFMT, + Node_K_nextfile +} NODETYPE; + +/* + * NOTE - this struct is a rather kludgey -- it is packed to minimize + * space usage, at the expense of cleanliness. Alter at own risk. + */ +typedef struct exp_node { + union { + struct { + union { + struct exp_node *lptr; + char *param_name; + } l; + union { + struct exp_node *rptr; + struct exp_node *(*pptr) (); + Regexp *preg; + struct for_loop_header *hd; + struct exp_node **av; + int r_ent; /* range entered */ + } r; + union { + char *name; + struct exp_node *extra; + } x; + short number; + unsigned char reflags; +# define CASE 1 +# define CONST 2 +# define FS_DFLT 4 + } nodep; + struct { + AWKNUM fltnum; /* this is here for optimal packing of + * the structure on many machines + */ + char *sp; + size_t slen; + unsigned char sref; + char idx; + } val; + struct { + struct exp_node *next; + char *name; + int length; + struct exp_node *value; + } hash; +#define hnext sub.hash.next +#define hname sub.hash.name +#define hlength sub.hash.length +#define hvalue sub.hash.value + struct { + struct exp_node *next; + struct exp_node *name; + struct exp_node *value; + } ahash; +#define ahnext sub.ahash.next +#define ahname sub.ahash.name +#define ahvalue sub.ahash.value + } sub; + NODETYPE type; + unsigned short flags; +# define MALLOC 1 /* can be free'd */ +# define TEMP 2 /* should be free'd */ +# define PERM 4 /* can't be free'd */ +# define STRING 8 /* assigned as string */ +# define STR 16 /* string value is current */ +# define NUM 32 /* numeric value is current */ +# define NUMBER 64 /* assigned as number */ +# define MAYBE_NUM 128 /* user input: if NUMERIC then + * a NUMBER + */ + char *vname; /* variable's name */ +} NODE; + +#define lnode sub.nodep.l.lptr +#define nextp sub.nodep.l.lptr +#define rnode sub.nodep.r.rptr +#define source_file sub.nodep.x.name +#define source_line sub.nodep.number +#define param_cnt sub.nodep.number +#define param sub.nodep.l.param_name + +#define subnode lnode +#define proc sub.nodep.r.pptr + +#define re_reg sub.nodep.r.preg +#define re_flags sub.nodep.reflags +#define re_text lnode +#define re_exp sub.nodep.x.extra +#define re_cnt sub.nodep.number + +#define forsub lnode +#define forloop rnode->sub.nodep.r.hd + +#define stptr sub.val.sp +#define stlen sub.val.slen +#define stref sub.val.sref +#define stfmt sub.val.idx + +#define numbr sub.val.fltnum + +#define var_value lnode +#define var_array sub.nodep.r.av + +#define condpair lnode +#define triggered sub.nodep.r.r_ent + +#ifdef DONTDEF +int primes[] = {31, 61, 127, 257, 509, 1021, 2053, 4099, 8191, 16381}; +#endif +/* a quick profile suggests that the following is a good value */ +#define HASHSIZE 127 + +typedef struct for_loop_header { + NODE *init; + NODE *cond; + NODE *incr; +} FOR_LOOP_HEADER; + +/* for "for(iggy in foo) {" */ +struct search { + NODE **arr_ptr; + NODE **arr_end; + NODE *bucket; + NODE *retval; +}; + +/* for faster input, bypass stdio */ +typedef struct iobuf { + int fd; + char *buf; + char *off; + char *end; + size_t size; /* this will be determined by an fstat() call */ + int cnt; + long secsiz; + int flag; +# define IOP_IS_TTY 1 +# define IOP_IS_INTERNAL 2 +# define IOP_NO_FREE 4 +} IOBUF; + +typedef void (*Func_ptr)(); + +/* + * structure used to dynamically maintain a linked-list of open files/pipes + */ +struct redirect { + unsigned int flag; +# define RED_FILE 1 +# define RED_PIPE 2 +# define RED_READ 4 +# define RED_WRITE 8 +# define RED_APPEND 16 +# define RED_NOBUF 32 +# define RED_USED 64 +# define RED_EOF 128 + char *value; + FILE *fp; + IOBUF *iop; + int pid; + int status; + struct redirect *prev; + struct redirect *next; +}; + +/* structure for our source, either a command line string or a source file */ +struct src { + enum srctype { CMDLINE = 1, SOURCEFILE } stype; + char *val; +}; + +/* longjmp return codes, must be nonzero */ +/* Continue means either for loop/while continue, or next input record */ +#define TAG_CONTINUE 1 +/* Break means either for/while break, or stop reading input */ +#define TAG_BREAK 2 +/* Return means return from a function call; leave value in ret_node */ +#define TAG_RETURN 3 + +#define HUGE INT_MAX + +/* -------------------------- External variables -------------------------- */ +/* gawk builtin variables */ +extern int NF; +extern int NR; +extern int FNR; +extern int IGNORECASE; +extern char *RS; +extern char *OFS; +extern int OFSlen; +extern char *ORS; +extern int ORSlen; +extern char *OFMT; +extern char *CONVFMT; +extern int CONVFMTidx; +extern int OFMTidx; +extern NODE *FS_node, *NF_node, *RS_node, *NR_node; +extern NODE *FILENAME_node, *OFS_node, *ORS_node, *OFMT_node; +extern NODE *CONVFMT_node; +extern NODE *FNR_node, *RLENGTH_node, *RSTART_node, *SUBSEP_node; +extern NODE *IGNORECASE_node; +extern NODE *FIELDWIDTHS_node; + +extern NODE **stack_ptr; +extern NODE *Nnull_string; +extern NODE **fields_arr; +extern int sourceline; +extern char *source; +extern NODE *expression_value; + +extern NODE *_t; /* used as temporary in tree_eval */ + +extern const char *myname; + +extern NODE *nextfree; +extern int field0_valid; +extern int do_unix; +extern int do_posix; +extern int do_lint; +extern int in_begin_rule; +extern int in_end_rule; + +/* ------------------------- Pseudo-functions ------------------------- */ + +#define is_identchar(c) (isalnum(c) || (c) == '_') + + +#ifndef MPROF +#define getnode(n) if (nextfree) n = nextfree, nextfree = nextfree->nextp;\ + else n = more_nodes() +#define freenode(n) ((n)->nextp = nextfree, nextfree = (n)) +#else +#define getnode(n) emalloc(n, NODE *, sizeof(NODE), "getnode") +#define freenode(n) free(n) +#endif + +#ifdef DEBUG +#define tree_eval(t) r_tree_eval(t) +#else +#define tree_eval(t) (_t = (t),(_t) == NULL ? Nnull_string : \ + ((_t)->type == Node_val ? (_t) : \ + ((_t)->type == Node_var ? (_t)->var_value : \ + ((_t)->type == Node_param_list ? \ + (stack_ptr[(_t)->param_cnt])->var_value : \ + r_tree_eval((_t)))))) +#endif + +#define make_number(x) mk_number((x), (MALLOC|NUM|NUMBER)) +#define tmp_number(x) mk_number((x), (MALLOC|TEMP|NUM|NUMBER)) + +#define free_temp(n) do {if ((n)->flags&TEMP) { unref(n); }} while (0) +#define make_string(s,l) make_str_node((s), SZTC (l),0) +#define SCAN 1 +#define ALREADY_MALLOCED 2 + +#define cant_happen() fatal("internal error line %d, file: %s", \ + __LINE__, __FILE__); + +#if defined(__STDC__) && !defined(NO_TOKEN_PASTING) +#define emalloc(var,ty,x,str) (void)((var=(ty)malloc((MALLOC_ARG_T)(x))) ||\ + (fatal("%s: %s: can't allocate memory (%s)",\ + (str), #var, strerror(errno)),0)) +#define erealloc(var,ty,x,str) (void)((var=(ty)realloc((char *)var,\ + (MALLOC_ARG_T)(x))) ||\ + (fatal("%s: %s: can't allocate memory (%s)",\ + (str), #var, strerror(errno)),0)) +#else /* __STDC__ */ +#define emalloc(var,ty,x,str) (void)((var=(ty)malloc((MALLOC_ARG_T)(x))) ||\ + (fatal("%s: %s: can't allocate memory (%s)",\ + (str), "var", strerror(errno)),0)) +#define erealloc(var,ty,x,str) (void)((var=(ty)realloc((char *)var,\ + (MALLOC_ARG_T)(x))) ||\ + (fatal("%s: %s: can't allocate memory (%s)",\ + (str), "var", strerror(errno)),0)) +#endif /* __STDC__ */ + +#ifdef DEBUG +#define force_number r_force_number +#define force_string r_force_string +#else /* not DEBUG */ +#ifdef lint +extern AWKNUM force_number(); +#endif +#ifdef MSDOS +extern double _msc51bug; +#define force_number(n) (_msc51bug=(_t = (n),(_t->flags & NUM) ? _t->numbr : r_force_number(_t))) +#else /* not MSDOS */ +#define force_number(n) (_t = (n),(_t->flags & NUM) ? _t->numbr : r_force_number(_t)) +#endif /* MSDOS */ +#define force_string(s) (_t = (s),(_t->flags & STR) ? _t : r_force_string(_t)) +#endif /* not DEBUG */ + +#define STREQ(a,b) (*(a) == *(b) && strcmp((a), (b)) == 0) +#define STREQN(a,b,n) ((n)&& *(a)== *(b) && strncmp((a), (b), SZTC (n)) == 0) + +/* ------------- Function prototypes or defs (as appropriate) ------------- */ + +/* array.c */ +extern NODE *concat_exp P((NODE *tree)); +extern void assoc_clear P((NODE *symbol)); +extern unsigned int hash P((char *s, int len)); +extern int in_array P((NODE *symbol, NODE *subs)); +extern NODE **assoc_lookup P((NODE *symbol, NODE *subs)); +extern void do_delete P((NODE *symbol, NODE *tree)); +extern void assoc_scan P((NODE *symbol, struct search *lookat)); +extern void assoc_next P((struct search *lookat)); +/* awk.tab.c */ +extern char *tokexpand P((void)); +extern char nextc P((void)); +extern NODE *node P((NODE *left, NODETYPE op, NODE *right)); +extern NODE *install P((char *name, NODE *value)); +extern NODE *lookup P((char *name)); +extern NODE *variable P((char *name, int can_free)); +extern int yyparse P((void)); +/* builtin.c */ +extern NODE *do_exp P((NODE *tree)); +extern NODE *do_index P((NODE *tree)); +extern NODE *do_int P((NODE *tree)); +extern NODE *do_length P((NODE *tree)); +extern NODE *do_log P((NODE *tree)); +extern NODE *do_sprintf P((NODE *tree)); +extern void do_printf P((NODE *tree)); +extern void print_simple P((NODE *tree, FILE *fp)); +extern NODE *do_sqrt P((NODE *tree)); +extern NODE *do_substr P((NODE *tree)); +extern NODE *do_strftime P((NODE *tree)); +extern NODE *do_systime P((NODE *tree)); +extern NODE *do_system P((NODE *tree)); +extern void do_print P((NODE *tree)); +extern NODE *do_tolower P((NODE *tree)); +extern NODE *do_toupper P((NODE *tree)); +extern NODE *do_atan2 P((NODE *tree)); +extern NODE *do_sin P((NODE *tree)); +extern NODE *do_cos P((NODE *tree)); +extern NODE *do_rand P((NODE *tree)); +extern NODE *do_srand P((NODE *tree)); +extern NODE *do_match P((NODE *tree)); +extern NODE *do_gsub P((NODE *tree)); +extern NODE *do_sub P((NODE *tree)); +/* eval.c */ +extern int interpret P((NODE *volatile tree)); +extern NODE *r_tree_eval P((NODE *tree)); +extern int cmp_nodes P((NODE *t1, NODE *t2)); +extern NODE **get_lhs P((NODE *ptr, Func_ptr *assign)); +extern void set_IGNORECASE P((void)); +void set_OFS P((void)); +void set_ORS P((void)); +void set_OFMT P((void)); +void set_CONVFMT P((void)); +/* field.c */ +extern void init_fields P((void)); +extern void set_record P((char *buf, int cnt, int freeold)); +extern void reset_record P((void)); +extern void set_NF P((void)); +extern NODE **get_field P((int num, Func_ptr *assign)); +extern NODE *do_split P((NODE *tree)); +extern void set_FS P((void)); +extern void set_RS P((void)); +extern void set_FIELDWIDTHS P((void)); +/* io.c */ +extern void set_FNR P((void)); +extern void set_NR P((void)); +extern void do_input P((void)); +extern struct redirect *redirect P((NODE *tree, int *errflg)); +extern NODE *do_close P((NODE *tree)); +extern int flush_io P((void)); +extern int close_io P((void)); +extern int devopen P((char *name, char *mode)); +extern int pathopen P((char *file)); +extern NODE *do_getline P((NODE *tree)); +extern void do_nextfile P((void)); +/* iop.c */ +extern int optimal_bufsize P((int fd)); +extern IOBUF *iop_alloc P((int fd)); +extern int get_a_record P((char **out, IOBUF *iop, int rs, int *errcode)); +/* main.c */ +extern int main P((int argc, char **argv)); +extern Regexp *mk_re_parse P((char *s, int ignorecase)); +extern void load_environ P((void)); +extern char *arg_assign P((char *arg)); +extern SIGTYPE catchsig P((int sig, int code)); +/* msg.c */ +#ifdef MSDOS +extern void err P((char *s, char *emsg, char *va_list, ...)); +extern void msg P((char *va_alist, ...)); +extern void warning P((char *va_alist, ...)); +extern void fatal P((char *va_alist, ...)); +#else +extern void err (); +extern void msg (); +extern void warning (); +extern void fatal (); +#endif +/* node.c */ +extern AWKNUM r_force_number P((NODE *n)); +extern NODE *r_force_string P((NODE *s)); +extern NODE *dupnode P((NODE *n)); +extern NODE *mk_number P((AWKNUM x, unsigned int flags)); +extern NODE *make_str_node P((char *s, size_t len, int scan )); +extern NODE *tmp_string P((char *s, size_t len )); +extern NODE *more_nodes P((void)); +#ifdef DEBUG +extern void freenode P((NODE *it)); +#endif +extern void unref P((NODE *tmp)); +extern int parse_escape P((char **string_ptr)); +/* re.c */ +extern Regexp *make_regexp P((char *s, int len, int ignorecase, int dfa)); +extern int research P((Regexp *rp, char *str, int start, int len, int need_start)); +extern void refree P((Regexp *rp)); +extern void reg_error P((const char *s)); +extern Regexp *re_update P((NODE *t)); +extern void resyntax P((int syntax)); +extern void resetup P((void)); + +/* strcase.c */ +extern int strcasecmp P((const char *s1, const char *s2)); +extern int strncasecmp P((const char *s1, const char *s2, register size_t n)); + +#ifdef atarist +/* atari/tmpnam.c */ +extern char *tmpnam P((char *buf)); +extern char *tempnam P((const char *path, const char *base)); +#endif + +/* Figure out what '\a' really is. */ +#ifdef __STDC__ +#define BELL '\a' /* sure makes life easy, don't it? */ +#else +# if 'z' - 'a' == 25 /* ascii */ +# if 'a' != 97 /* machine is dumb enough to use mark parity */ +# define BELL '\207' +# else +# define BELL '\07' +# endif +# else +# define BELL '\057' +# endif +#endif + +extern char casetable[]; /* for case-independent regexp matching */ diff --git a/gnu/usr.bin/awk/awk.y b/gnu/usr.bin/awk/awk.y new file mode 100644 index 000000000000..6e87f1c449cc --- /dev/null +++ b/gnu/usr.bin/awk/awk.y @@ -0,0 +1,1804 @@ +/* + * awk.y --- yacc/bison parser + */ + +/* + * Copyright (C) 1986, 1988, 1989, 1991, 1992 the Free Software Foundation, Inc. + * + * This file is part of GAWK, the GNU implementation of the + * AWK Progamming Language. + * + * GAWK is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * GAWK is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with GAWK; see the file COPYING. If not, write to + * the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +%{ +#ifdef DEBUG +#define YYDEBUG 12 +#endif + +#include "awk.h" + +static void yyerror (); /* va_alist */ +static char *get_src_buf P((void)); +static int yylex P((void)); +static NODE *node_common P((NODETYPE op)); +static NODE *snode P((NODE *subn, NODETYPE op, int sindex)); +static NODE *mkrangenode P((NODE *cpair)); +static NODE *make_for_loop P((NODE *init, NODE *cond, NODE *incr)); +static NODE *append_right P((NODE *list, NODE *new)); +static void func_install P((NODE *params, NODE *def)); +static void pop_var P((NODE *np, int freeit)); +static void pop_params P((NODE *params)); +static NODE *make_param P((char *name)); +static NODE *mk_rexp P((NODE *exp)); + +static int want_assign; /* lexical scanning kludge */ +static int want_regexp; /* lexical scanning kludge */ +static int can_return; /* lexical scanning kludge */ +static int io_allowed = 1; /* lexical scanning kludge */ +static char *lexptr; /* pointer to next char during parsing */ +static char *lexend; +static char *lexptr_begin; /* keep track of where we were for error msgs */ +static char *lexeme; /* beginning of lexeme for debugging */ +static char *thisline = NULL; +#define YYDEBUG_LEXER_TEXT (lexeme) +static int param_counter; +static char *tokstart = NULL; +static char *token = NULL; +static char *tokend; + +NODE *variables[HASHSIZE]; + +extern char *source; +extern int sourceline; +extern struct src *srcfiles; +extern int numfiles; +extern int errcount; +extern NODE *begin_block; +extern NODE *end_block; +%} + +%union { + long lval; + AWKNUM fval; + NODE *nodeval; + NODETYPE nodetypeval; + char *sval; + NODE *(*ptrval)(); +} + +%type <nodeval> function_prologue function_body +%type <nodeval> rexp exp start program rule simp_exp +%type <nodeval> non_post_simp_exp +%type <nodeval> pattern +%type <nodeval> action variable param_list +%type <nodeval> rexpression_list opt_rexpression_list +%type <nodeval> expression_list opt_expression_list +%type <nodeval> statements statement if_statement opt_param_list +%type <nodeval> opt_exp opt_variable regexp +%type <nodeval> input_redir output_redir +%type <nodetypeval> print +%type <sval> func_name +%type <lval> lex_builtin + +%token <sval> FUNC_CALL NAME REGEXP +%token <lval> ERROR +%token <nodeval> YNUMBER YSTRING +%token <nodetypeval> RELOP APPEND_OP +%token <nodetypeval> ASSIGNOP MATCHOP NEWLINE CONCAT_OP +%token <nodetypeval> LEX_BEGIN LEX_END LEX_IF LEX_ELSE LEX_RETURN LEX_DELETE +%token <nodetypeval> LEX_WHILE LEX_DO LEX_FOR LEX_BREAK LEX_CONTINUE +%token <nodetypeval> LEX_PRINT LEX_PRINTF LEX_NEXT LEX_EXIT LEX_FUNCTION +%token <nodetypeval> LEX_GETLINE +%token <nodetypeval> LEX_IN +%token <lval> LEX_AND LEX_OR INCREMENT DECREMENT +%token <lval> LEX_BUILTIN LEX_LENGTH + +/* these are just yylval numbers */ + +/* Lowest to highest */ +%right ASSIGNOP +%right '?' ':' +%left LEX_OR +%left LEX_AND +%left LEX_GETLINE +%nonassoc LEX_IN +%left FUNC_CALL LEX_BUILTIN LEX_LENGTH +%nonassoc MATCHOP +%nonassoc RELOP '<' '>' '|' APPEND_OP +%left CONCAT_OP +%left YSTRING YNUMBER +%left '+' '-' +%left '*' '/' '%' +%right '!' UNARY +%right '^' +%left INCREMENT DECREMENT +%left '$' +%left '(' ')' +%% + +start + : opt_nls program opt_nls + { expression_value = $2; } + ; + +program + : rule + { + if ($1 != NULL) + $$ = $1; + else + $$ = NULL; + yyerrok; + } + | program rule + /* add the rule to the tail of list */ + { + if ($2 == NULL) + $$ = $1; + else if ($1 == NULL) + $$ = $2; + else { + if ($1->type != Node_rule_list) + $1 = node($1, Node_rule_list, + (NODE*)NULL); + $$ = append_right ($1, + node($2, Node_rule_list,(NODE *) NULL)); + } + yyerrok; + } + | error { $$ = NULL; } + | program error { $$ = NULL; } + ; + +rule + : LEX_BEGIN { io_allowed = 0; } + action + { + if (begin_block) { + if (begin_block->type != Node_rule_list) + begin_block = node(begin_block, Node_rule_list, + (NODE *)NULL); + (void) append_right (begin_block, node( + node((NODE *)NULL, Node_rule_node, $3), + Node_rule_list, (NODE *)NULL) ); + } else + begin_block = node((NODE *)NULL, Node_rule_node, $3); + $$ = NULL; + io_allowed = 1; + yyerrok; + } + | LEX_END { io_allowed = 0; } + action + { + if (end_block) { + if (end_block->type != Node_rule_list) + end_block = node(end_block, Node_rule_list, + (NODE *)NULL); + (void) append_right (end_block, node( + node((NODE *)NULL, Node_rule_node, $3), + Node_rule_list, (NODE *)NULL)); + } else + end_block = node((NODE *)NULL, Node_rule_node, $3); + $$ = NULL; + io_allowed = 1; + yyerrok; + } + | LEX_BEGIN statement_term + { + warning("BEGIN blocks must have an action part"); + errcount++; + yyerrok; + } + | LEX_END statement_term + { + warning("END blocks must have an action part"); + errcount++; + yyerrok; + } + | pattern action + { $$ = node ($1, Node_rule_node, $2); yyerrok; } + | action + { $$ = node ((NODE *)NULL, Node_rule_node, $1); yyerrok; } + | pattern statement_term + { + $$ = node ($1, + Node_rule_node, + node(node(node(make_number(0.0), + Node_field_spec, + (NODE *) NULL), + Node_expression_list, + (NODE *) NULL), + Node_K_print, + (NODE *) NULL)); + yyerrok; + } + | function_prologue function_body + { + func_install($1, $2); + $$ = NULL; + yyerrok; + } + ; + +func_name + : NAME + { $$ = $1; } + | FUNC_CALL + { $$ = $1; } + | lex_builtin + { + yyerror("%s() is a built-in function, it cannot be redefined", + tokstart); + errcount++; + /* yyerrok; */ + } + ; + +lex_builtin + : LEX_BUILTIN + | LEX_LENGTH + ; + +function_prologue + : LEX_FUNCTION + { + param_counter = 0; + } + func_name '(' opt_param_list r_paren opt_nls + { + $$ = append_right(make_param($3), $5); + can_return = 1; + } + ; + +function_body + : l_brace statements r_brace opt_semi + { + $$ = $2; + can_return = 0; + } + ; + + +pattern + : exp + { $$ = $1; } + | exp comma exp + { $$ = mkrangenode ( node($1, Node_cond_pair, $3) ); } + ; + +regexp + /* + * In this rule, want_regexp tells yylex that the next thing + * is a regexp so it should read up to the closing slash. + */ + : '/' + { ++want_regexp; } + REGEXP '/' + { + NODE *n; + int len; + + getnode(n); + n->type = Node_regex; + len = strlen($3); + n->re_exp = make_string($3, len); + n->re_reg = make_regexp($3, len, 0, 1); + n->re_text = NULL; + n->re_flags = CONST; + n->re_cnt = 1; + $$ = n; + } + ; + +action + : l_brace statements r_brace opt_semi opt_nls + { $$ = $2 ; } + | l_brace r_brace opt_semi opt_nls + { $$ = NULL; } + ; + +statements + : statement + { $$ = $1; } + | statements statement + { + if ($1 == NULL || $1->type != Node_statement_list) + $1 = node($1, Node_statement_list,(NODE *)NULL); + $$ = append_right($1, + node( $2, Node_statement_list, (NODE *)NULL)); + yyerrok; + } + | error + { $$ = NULL; } + | statements error + { $$ = NULL; } + ; + +statement_term + : nls + | semi opt_nls + ; + +statement + : semi opt_nls + { $$ = NULL; } + | l_brace r_brace + { $$ = NULL; } + | l_brace statements r_brace + { $$ = $2; } + | if_statement + { $$ = $1; } + | LEX_WHILE '(' exp r_paren opt_nls statement + { $$ = node ($3, Node_K_while, $6); } + | LEX_DO opt_nls statement LEX_WHILE '(' exp r_paren opt_nls + { $$ = node ($6, Node_K_do, $3); } + | LEX_FOR '(' NAME LEX_IN NAME r_paren opt_nls statement + { + $$ = node ($8, Node_K_arrayfor, make_for_loop(variable($3,1), + (NODE *)NULL, variable($5,1))); + } + | LEX_FOR '(' opt_exp semi exp semi opt_exp r_paren opt_nls statement + { + $$ = node($10, Node_K_for, (NODE *)make_for_loop($3, $5, $7)); + } + | LEX_FOR '(' opt_exp semi semi opt_exp r_paren opt_nls statement + { + $$ = node ($9, Node_K_for, + (NODE *)make_for_loop($3, (NODE *)NULL, $6)); + } + | LEX_BREAK statement_term + /* for break, maybe we'll have to remember where to break to */ + { $$ = node ((NODE *)NULL, Node_K_break, (NODE *)NULL); } + | LEX_CONTINUE statement_term + /* similarly */ + { $$ = node ((NODE *)NULL, Node_K_continue, (NODE *)NULL); } + | print '(' expression_list r_paren output_redir statement_term + { $$ = node ($3, $1, $5); } + | print opt_rexpression_list output_redir statement_term + { + if ($1 == Node_K_print && $2 == NULL) + $2 = node(node(make_number(0.0), + Node_field_spec, + (NODE *) NULL), + Node_expression_list, + (NODE *) NULL); + + $$ = node ($2, $1, $3); + } + | LEX_NEXT opt_exp statement_term + { NODETYPE type; + + if ($2 && $2 == lookup("file")) { + if (do_lint) + warning("`next file' is a gawk extension"); + else if (do_unix || do_posix) + yyerror("`next file' is a gawk extension"); + else if (! io_allowed) + yyerror("`next file' used in BEGIN or END action"); + type = Node_K_nextfile; + } else { + if (! io_allowed) + yyerror("next used in BEGIN or END action"); + type = Node_K_next; + } + $$ = node ((NODE *)NULL, type, (NODE *)NULL); + } + | LEX_EXIT opt_exp statement_term + { $$ = node ($2, Node_K_exit, (NODE *)NULL); } + | LEX_RETURN + { if (! can_return) yyerror("return used outside function context"); } + opt_exp statement_term + { $$ = node ($3, Node_K_return, (NODE *)NULL); } + | LEX_DELETE NAME '[' expression_list ']' statement_term + { $$ = node (variable($2,1), Node_K_delete, $4); } + | exp statement_term + { $$ = $1; } + ; + +print + : LEX_PRINT + { $$ = $1; } + | LEX_PRINTF + { $$ = $1; } + ; + +if_statement + : LEX_IF '(' exp r_paren opt_nls statement + { + $$ = node($3, Node_K_if, + node($6, Node_if_branches, (NODE *)NULL)); + } + | LEX_IF '(' exp r_paren opt_nls statement + LEX_ELSE opt_nls statement + { $$ = node ($3, Node_K_if, + node ($6, Node_if_branches, $9)); } + ; + +nls + : NEWLINE + { want_assign = 0; } + | nls NEWLINE + ; + +opt_nls + : /* empty */ + | nls + ; + +input_redir + : /* empty */ + { $$ = NULL; } + | '<' simp_exp + { $$ = node ($2, Node_redirect_input, (NODE *)NULL); } + ; + +output_redir + : /* empty */ + { $$ = NULL; } + | '>' exp + { $$ = node ($2, Node_redirect_output, (NODE *)NULL); } + | APPEND_OP exp + { $$ = node ($2, Node_redirect_append, (NODE *)NULL); } + | '|' exp + { $$ = node ($2, Node_redirect_pipe, (NODE *)NULL); } + ; + +opt_param_list + : /* empty */ + { $$ = NULL; } + | param_list + { $$ = $1; } + ; + +param_list + : NAME + { $$ = make_param($1); } + | param_list comma NAME + { $$ = append_right($1, make_param($3)); yyerrok; } + | error + { $$ = NULL; } + | param_list error + { $$ = NULL; } + | param_list comma error + { $$ = NULL; } + ; + +/* optional expression, as in for loop */ +opt_exp + : /* empty */ + { $$ = NULL; } + | exp + { $$ = $1; } + ; + +opt_rexpression_list + : /* empty */ + { $$ = NULL; } + | rexpression_list + { $$ = $1; } + ; + +rexpression_list + : rexp + { $$ = node ($1, Node_expression_list, (NODE *)NULL); } + | rexpression_list comma rexp + { + $$ = append_right($1, + node( $3, Node_expression_list, (NODE *)NULL)); + yyerrok; + } + | error + { $$ = NULL; } + | rexpression_list error + { $$ = NULL; } + | rexpression_list error rexp + { $$ = NULL; } + | rexpression_list comma error + { $$ = NULL; } + ; + +opt_expression_list + : /* empty */ + { $$ = NULL; } + | expression_list + { $$ = $1; } + ; + +expression_list + : exp + { $$ = node ($1, Node_expression_list, (NODE *)NULL); } + | expression_list comma exp + { + $$ = append_right($1, + node( $3, Node_expression_list, (NODE *)NULL)); + yyerrok; + } + | error + { $$ = NULL; } + | expression_list error + { $$ = NULL; } + | expression_list error exp + { $$ = NULL; } + | expression_list comma error + { $$ = NULL; } + ; + +/* Expressions, not including the comma operator. */ +exp : variable ASSIGNOP + { want_assign = 0; } + exp + { + if (do_lint && $4->type == Node_regex) + warning("Regular expression on left of assignment."); + $$ = node ($1, $2, $4); + } + | '(' expression_list r_paren LEX_IN NAME + { $$ = node (variable($5,1), Node_in_array, $2); } + | exp '|' LEX_GETLINE opt_variable + { + $$ = node ($4, Node_K_getline, + node ($1, Node_redirect_pipein, (NODE *)NULL)); + } + | LEX_GETLINE opt_variable input_redir + { + if (do_lint && ! io_allowed && $3 == NULL) + warning("non-redirected getline undefined inside BEGIN or END action"); + $$ = node ($2, Node_K_getline, $3); + } + | exp LEX_AND exp + { $$ = node ($1, Node_and, $3); } + | exp LEX_OR exp + { $$ = node ($1, Node_or, $3); } + | exp MATCHOP exp + { + if ($1->type == Node_regex) + warning("Regular expression on left of MATCH operator."); + $$ = node ($1, $2, mk_rexp($3)); + } + | regexp + { $$ = $1; } + | '!' regexp %prec UNARY + { + $$ = node(node(make_number(0.0), + Node_field_spec, + (NODE *) NULL), + Node_nomatch, + $2); + } + | exp LEX_IN NAME + { $$ = node (variable($3,1), Node_in_array, $1); } + | exp RELOP exp + { + if (do_lint && $3->type == Node_regex) + warning("Regular expression on left of comparison."); + $$ = node ($1, $2, $3); + } + | exp '<' exp + { $$ = node ($1, Node_less, $3); } + | exp '>' exp + { $$ = node ($1, Node_greater, $3); } + | exp '?' exp ':' exp + { $$ = node($1, Node_cond_exp, node($3, Node_if_branches, $5));} + | simp_exp + { $$ = $1; } + | exp simp_exp %prec CONCAT_OP + { $$ = node ($1, Node_concat, $2); } + ; + +rexp + : variable ASSIGNOP + { want_assign = 0; } + rexp + { $$ = node ($1, $2, $4); } + | rexp LEX_AND rexp + { $$ = node ($1, Node_and, $3); } + | rexp LEX_OR rexp + { $$ = node ($1, Node_or, $3); } + | LEX_GETLINE opt_variable input_redir + { + if (do_lint && ! io_allowed && $3 == NULL) + warning("non-redirected getline undefined inside BEGIN or END action"); + $$ = node ($2, Node_K_getline, $3); + } + | regexp + { $$ = $1; } + | '!' regexp %prec UNARY + { $$ = node((NODE *) NULL, Node_nomatch, $2); } + | rexp MATCHOP rexp + { $$ = node ($1, $2, mk_rexp($3)); } + | rexp LEX_IN NAME + { $$ = node (variable($3,1), Node_in_array, $1); } + | rexp RELOP rexp + { $$ = node ($1, $2, $3); } + | rexp '?' rexp ':' rexp + { $$ = node($1, Node_cond_exp, node($3, Node_if_branches, $5));} + | simp_exp + { $$ = $1; } + | rexp simp_exp %prec CONCAT_OP + { $$ = node ($1, Node_concat, $2); } + ; + +simp_exp + : non_post_simp_exp + /* Binary operators in order of decreasing precedence. */ + | simp_exp '^' simp_exp + { $$ = node ($1, Node_exp, $3); } + | simp_exp '*' simp_exp + { $$ = node ($1, Node_times, $3); } + | simp_exp '/' simp_exp + { $$ = node ($1, Node_quotient, $3); } + | simp_exp '%' simp_exp + { $$ = node ($1, Node_mod, $3); } + | simp_exp '+' simp_exp + { $$ = node ($1, Node_plus, $3); } + | simp_exp '-' simp_exp + { $$ = node ($1, Node_minus, $3); } + | variable INCREMENT + { $$ = node ($1, Node_postincrement, (NODE *)NULL); } + | variable DECREMENT + { $$ = node ($1, Node_postdecrement, (NODE *)NULL); } + ; + +non_post_simp_exp + : '!' simp_exp %prec UNARY + { $$ = node ($2, Node_not,(NODE *) NULL); } + | '(' exp r_paren + { $$ = $2; } + | LEX_BUILTIN + '(' opt_expression_list r_paren + { $$ = snode ($3, Node_builtin, (int) $1); } + | LEX_LENGTH '(' opt_expression_list r_paren + { $$ = snode ($3, Node_builtin, (int) $1); } + | LEX_LENGTH + { + if (do_lint) + warning("call of `length' without parentheses is not portable"); + $$ = snode ((NODE *)NULL, Node_builtin, (int) $1); + if (do_posix) + warning( "call of `length' without parentheses is deprecated by POSIX"); + } + | FUNC_CALL '(' opt_expression_list r_paren + { + $$ = node ($3, Node_func_call, make_string($1, strlen($1))); + } + | variable + | INCREMENT variable + { $$ = node ($2, Node_preincrement, (NODE *)NULL); } + | DECREMENT variable + { $$ = node ($2, Node_predecrement, (NODE *)NULL); } + | YNUMBER + { $$ = $1; } + | YSTRING + { $$ = $1; } + + | '-' simp_exp %prec UNARY + { if ($2->type == Node_val) { + $2->numbr = -(force_number($2)); + $$ = $2; + } else + $$ = node ($2, Node_unary_minus, (NODE *)NULL); + } + | '+' simp_exp %prec UNARY + { $$ = $2; } + ; + +opt_variable + : /* empty */ + { $$ = NULL; } + | variable + { $$ = $1; } + ; + +variable + : NAME + { $$ = variable($1,1); } + | NAME '[' expression_list ']' + { + if ($3->rnode == NULL) { + $$ = node (variable($1,1), Node_subscript, $3->lnode); + freenode($3); + } else + $$ = node (variable($1,1), Node_subscript, $3); + } + | '$' non_post_simp_exp + { $$ = node ($2, Node_field_spec, (NODE *)NULL); } + ; + +l_brace + : '{' opt_nls + ; + +r_brace + : '}' opt_nls { yyerrok; } + ; + +r_paren + : ')' { yyerrok; } + ; + +opt_semi + : /* empty */ + | semi + ; + +semi + : ';' { yyerrok; want_assign = 0; } + ; + +comma : ',' opt_nls { yyerrok; } + ; + +%% + +struct token { + char *operator; /* text to match */ + NODETYPE value; /* node type */ + int class; /* lexical class */ + unsigned flags; /* # of args. allowed and compatability */ +# define ARGS 0xFF /* 0, 1, 2, 3 args allowed (any combination */ +# define A(n) (1<<(n)) +# define VERSION 0xFF00 /* old awk is zero */ +# define NOT_OLD 0x0100 /* feature not in old awk */ +# define NOT_POSIX 0x0200 /* feature not in POSIX */ +# define GAWKX 0x0400 /* gawk extension */ + NODE *(*ptr) (); /* function that implements this keyword */ +}; + +extern NODE + *do_exp(), *do_getline(), *do_index(), *do_length(), + *do_sqrt(), *do_log(), *do_sprintf(), *do_substr(), + *do_split(), *do_system(), *do_int(), *do_close(), + *do_atan2(), *do_sin(), *do_cos(), *do_rand(), + *do_srand(), *do_match(), *do_tolower(), *do_toupper(), + *do_sub(), *do_gsub(), *do_strftime(), *do_systime(); + +/* Tokentab is sorted ascii ascending order, so it can be binary searched. */ + +static struct token tokentab[] = { +{"BEGIN", Node_illegal, LEX_BEGIN, 0, 0}, +{"END", Node_illegal, LEX_END, 0, 0}, +{"atan2", Node_builtin, LEX_BUILTIN, NOT_OLD|A(2), do_atan2}, +{"break", Node_K_break, LEX_BREAK, 0, 0}, +{"close", Node_builtin, LEX_BUILTIN, NOT_OLD|A(1), do_close}, +{"continue", Node_K_continue, LEX_CONTINUE, 0, 0}, +{"cos", Node_builtin, LEX_BUILTIN, NOT_OLD|A(1), do_cos}, +{"delete", Node_K_delete, LEX_DELETE, NOT_OLD, 0}, +{"do", Node_K_do, LEX_DO, NOT_OLD, 0}, +{"else", Node_illegal, LEX_ELSE, 0, 0}, +{"exit", Node_K_exit, LEX_EXIT, 0, 0}, +{"exp", Node_builtin, LEX_BUILTIN, A(1), do_exp}, +{"for", Node_K_for, LEX_FOR, 0, 0}, +{"func", Node_K_function, LEX_FUNCTION, NOT_POSIX|NOT_OLD, 0}, +{"function", Node_K_function, LEX_FUNCTION, NOT_OLD, 0}, +{"getline", Node_K_getline, LEX_GETLINE, NOT_OLD, 0}, +{"gsub", Node_builtin, LEX_BUILTIN, NOT_OLD|A(2)|A(3), do_gsub}, +{"if", Node_K_if, LEX_IF, 0, 0}, +{"in", Node_illegal, LEX_IN, 0, 0}, +{"index", Node_builtin, LEX_BUILTIN, A(2), do_index}, +{"int", Node_builtin, LEX_BUILTIN, A(1), do_int}, +{"length", Node_builtin, LEX_LENGTH, A(0)|A(1), do_length}, +{"log", Node_builtin, LEX_BUILTIN, A(1), do_log}, +{"match", Node_builtin, LEX_BUILTIN, NOT_OLD|A(2), do_match}, +{"next", Node_K_next, LEX_NEXT, 0, 0}, +{"print", Node_K_print, LEX_PRINT, 0, 0}, +{"printf", Node_K_printf, LEX_PRINTF, 0, 0}, +{"rand", Node_builtin, LEX_BUILTIN, NOT_OLD|A(0), do_rand}, +{"return", Node_K_return, LEX_RETURN, NOT_OLD, 0}, +{"sin", Node_builtin, LEX_BUILTIN, NOT_OLD|A(1), do_sin}, +{"split", Node_builtin, LEX_BUILTIN, A(2)|A(3), do_split}, +{"sprintf", Node_builtin, LEX_BUILTIN, 0, do_sprintf}, +{"sqrt", Node_builtin, LEX_BUILTIN, A(1), do_sqrt}, +{"srand", Node_builtin, LEX_BUILTIN, NOT_OLD|A(0)|A(1), do_srand}, +{"strftime", Node_builtin, LEX_BUILTIN, GAWKX|A(1)|A(2), do_strftime}, +{"sub", Node_builtin, LEX_BUILTIN, NOT_OLD|A(2)|A(3), do_sub}, +{"substr", Node_builtin, LEX_BUILTIN, A(2)|A(3), do_substr}, +{"system", Node_builtin, LEX_BUILTIN, NOT_OLD|A(1), do_system}, +{"systime", Node_builtin, LEX_BUILTIN, GAWKX|A(0), do_systime}, +{"tolower", Node_builtin, LEX_BUILTIN, NOT_OLD|A(1), do_tolower}, +{"toupper", Node_builtin, LEX_BUILTIN, NOT_OLD|A(1), do_toupper}, +{"while", Node_K_while, LEX_WHILE, 0, 0}, +}; + +/* VARARGS0 */ +static void +yyerror(va_alist) +va_dcl +{ + va_list args; + char *mesg = NULL; + register char *bp, *cp; + char *scan; + char buf[120]; + + errcount++; + /* Find the current line in the input file */ + if (lexptr) { + if (!thisline) { + cp = lexeme; + if (*cp == '\n') { + cp--; + mesg = "unexpected newline"; + } + for ( ; cp != lexptr_begin && *cp != '\n'; --cp) + ; + if (*cp == '\n') + cp++; + thisline = cp; + } + /* NL isn't guaranteed */ + bp = lexeme; + while (bp < lexend && *bp && *bp != '\n') + bp++; + } else { + thisline = "(END OF FILE)"; + bp = thisline + 13; + } + msg("%.*s", (int) (bp - thisline), thisline); + bp = buf; + cp = buf + sizeof(buf) - 24; /* 24 more than longest msg. input */ + if (lexptr) { + scan = thisline; + while (bp < cp && scan < lexeme) + if (*scan++ == '\t') + *bp++ = '\t'; + else + *bp++ = ' '; + *bp++ = '^'; + *bp++ = ' '; + } + va_start(args); + if (mesg == NULL) + mesg = va_arg(args, char *); + strcpy(bp, mesg); + err("", buf, args); + va_end(args); + exit(2); +} + +static char * +get_src_buf() +{ + static int samefile = 0; + static int nextfile = 0; + static char *buf = NULL; + static int fd; + int n; + register char *scan; + static int len = 0; + static int did_newline = 0; +# define SLOP 128 /* enough space to hold most source lines */ + + if (nextfile > numfiles) + return NULL; + + if (srcfiles[nextfile].stype == CMDLINE) { + if (len == 0) { + len = strlen(srcfiles[nextfile].val); + sourceline = 1; + lexptr = lexptr_begin = srcfiles[nextfile].val; + lexend = lexptr + len; + } else if (!did_newline && *(lexptr-1) != '\n') { + /* + * The following goop is to ensure that the source + * ends with a newline and that the entire current + * line is available for error messages. + */ + int offset; + + did_newline = 1; + offset = lexptr - lexeme; + for (scan = lexeme; scan > lexptr_begin; scan--) + if (*scan == '\n') { + scan++; + break; + } + len = lexptr - scan; + emalloc(buf, char *, len+1, "get_src_buf"); + memcpy(buf, scan, len); + thisline = buf; + lexptr = buf + len; + *lexptr = '\n'; + lexeme = lexptr - offset; + lexptr_begin = buf; + lexend = lexptr + 1; + } else { + len = 0; + lexeme = lexptr = lexptr_begin = NULL; + } + if (lexptr == NULL && ++nextfile <= numfiles) + return get_src_buf(); + return lexptr; + } + if (!samefile) { + source = srcfiles[nextfile].val; + if (source == NULL) { + if (buf) { + free(buf); + buf = NULL; + } + len = 0; + return lexeme = lexptr = lexptr_begin = NULL; + } + fd = pathopen(source); + if (fd == -1) + fatal("can't open source file \"%s\" for reading (%s)", + source, strerror(errno)); + len = optimal_bufsize(fd); + if (buf) + free(buf); + emalloc(buf, char *, len + SLOP, "get_src_buf"); + lexptr_begin = buf + SLOP; + samefile = 1; + sourceline = 1; + } else { + /* + * Here, we retain the current source line (up to length SLOP) + * in the beginning of the buffer that was overallocated above + */ + int offset; + int linelen; + + offset = lexptr - lexeme; + for (scan = lexeme; scan > lexptr_begin; scan--) + if (*scan == '\n') { + scan++; + break; + } + linelen = lexptr - scan; + if (linelen > SLOP) + linelen = SLOP; + thisline = buf + SLOP - linelen; + memcpy(thisline, scan, linelen); + lexeme = buf + SLOP - offset; + lexptr_begin = thisline; + } + n = read(fd, buf + SLOP, len); + if (n == -1) + fatal("can't read sourcefile \"%s\" (%s)", + source, strerror(errno)); + if (n == 0) { + samefile = 0; + nextfile++; + len = 0; + return get_src_buf(); + } + lexptr = buf + SLOP; + lexend = lexptr + n; + return buf; +} + +#define tokadd(x) (*token++ = (x), token == tokend ? tokexpand() : token) + +char * +tokexpand() +{ + static int toksize = 60; + int tokoffset; + + tokoffset = token - tokstart; + toksize *= 2; + if (tokstart) + erealloc(tokstart, char *, toksize, "tokexpand"); + else + emalloc(tokstart, char *, toksize, "tokexpand"); + tokend = tokstart + toksize; + token = tokstart + tokoffset; + return token; +} + +#if DEBUG +char +nextc() { + if (lexptr && lexptr < lexend) + return *lexptr++; + else if (get_src_buf()) + return *lexptr++; + else + return '\0'; +} +#else +#define nextc() ((lexptr && lexptr < lexend) ? \ + *lexptr++ : \ + (get_src_buf() ? *lexptr++ : '\0') \ + ) +#endif +#define pushback() (lexptr && lexptr > lexptr_begin ? lexptr-- : lexptr) + +/* + * Read the input and turn it into tokens. + */ + +static int +yylex() +{ + register int c; + int seen_e = 0; /* These are for numbers */ + int seen_point = 0; + int esc_seen; /* for literal strings */ + int low, mid, high; + static int did_newline = 0; + char *tokkey; + + if (!nextc()) + return 0; + pushback(); + lexeme = lexptr; + thisline = NULL; + if (want_regexp) { + int in_brack = 0; + + want_regexp = 0; + token = tokstart; + while ((c = nextc()) != 0) { + switch (c) { + case '[': + in_brack = 1; + break; + case ']': + in_brack = 0; + break; + case '\\': + if ((c = nextc()) == '\0') { + yyerror("unterminated regexp ends with \\ at end of file"); + } else if (c == '\n') { + sourceline++; + continue; + } else + tokadd('\\'); + break; + case '/': /* end of the regexp */ + if (in_brack) + break; + + pushback(); + tokadd('\0'); + yylval.sval = tokstart; + return REGEXP; + case '\n': + pushback(); + yyerror("unterminated regexp"); + case '\0': + yyerror("unterminated regexp at end of file"); + } + tokadd(c); + } + } +retry: + while ((c = nextc()) == ' ' || c == '\t') + ; + + lexeme = lexptr ? lexptr - 1 : lexptr; + thisline = NULL; + token = tokstart; + yylval.nodetypeval = Node_illegal; + + switch (c) { + case 0: + return 0; + + case '\n': + sourceline++; + return NEWLINE; + + case '#': /* it's a comment */ + while ((c = nextc()) != '\n') { + if (c == '\0') + return 0; + } + sourceline++; + return NEWLINE; + + case '\\': +#ifdef RELAXED_CONTINUATION + if (!do_unix) { /* strip trailing white-space and/or comment */ + while ((c = nextc()) == ' ' || c == '\t') continue; + if (c == '#') + while ((c = nextc()) != '\n') if (!c) break; + pushback(); + } +#endif /*RELAXED_CONTINUATION*/ + if (nextc() == '\n') { + sourceline++; + goto retry; + } else + yyerror("inappropriate use of backslash"); + break; + + case '$': + want_assign = 1; + return '$'; + + case ')': + case ']': + case '(': + case '[': + case ';': + case ':': + case '?': + case '{': + case ',': + return c; + + case '*': + if ((c = nextc()) == '=') { + yylval.nodetypeval = Node_assign_times; + return ASSIGNOP; + } else if (do_posix) { + pushback(); + return '*'; + } else if (c == '*') { + /* make ** and **= aliases for ^ and ^= */ + static int did_warn_op = 0, did_warn_assgn = 0; + + if (nextc() == '=') { + if (do_lint && ! did_warn_assgn) { + did_warn_assgn = 1; + warning("**= is not allowed by POSIX"); + } + yylval.nodetypeval = Node_assign_exp; + return ASSIGNOP; + } else { + pushback(); + if (do_lint && ! did_warn_op) { + did_warn_op = 1; + warning("** is not allowed by POSIX"); + } + return '^'; + } + } + pushback(); + return '*'; + + case '/': + if (want_assign) { + if (nextc() == '=') { + yylval.nodetypeval = Node_assign_quotient; + return ASSIGNOP; + } + pushback(); + } + return '/'; + + case '%': + if (nextc() == '=') { + yylval.nodetypeval = Node_assign_mod; + return ASSIGNOP; + } + pushback(); + return '%'; + + case '^': + { + static int did_warn_op = 0, did_warn_assgn = 0; + + if (nextc() == '=') { + + if (do_lint && ! did_warn_assgn) { + did_warn_assgn = 1; + warning("operator `^=' is not supported in old awk"); + } + yylval.nodetypeval = Node_assign_exp; + return ASSIGNOP; + } + pushback(); + if (do_lint && ! did_warn_op) { + did_warn_op = 1; + warning("operator `^' is not supported in old awk"); + } + return '^'; + } + + case '+': + if ((c = nextc()) == '=') { + yylval.nodetypeval = Node_assign_plus; + return ASSIGNOP; + } + if (c == '+') + return INCREMENT; + pushback(); + return '+'; + + case '!': + if ((c = nextc()) == '=') { + yylval.nodetypeval = Node_notequal; + return RELOP; + } + if (c == '~') { + yylval.nodetypeval = Node_nomatch; + want_assign = 0; + return MATCHOP; + } + pushback(); + return '!'; + + case '<': + if (nextc() == '=') { + yylval.nodetypeval = Node_leq; + return RELOP; + } + yylval.nodetypeval = Node_less; + pushback(); + return '<'; + + case '=': + if (nextc() == '=') { + yylval.nodetypeval = Node_equal; + return RELOP; + } + yylval.nodetypeval = Node_assign; + pushback(); + return ASSIGNOP; + + case '>': + if ((c = nextc()) == '=') { + yylval.nodetypeval = Node_geq; + return RELOP; + } else if (c == '>') { + yylval.nodetypeval = Node_redirect_append; + return APPEND_OP; + } + yylval.nodetypeval = Node_greater; + pushback(); + return '>'; + + case '~': + yylval.nodetypeval = Node_match; + want_assign = 0; + return MATCHOP; + + case '}': + /* + * Added did newline stuff. Easier than + * hacking the grammar + */ + if (did_newline) { + did_newline = 0; + return c; + } + did_newline++; + --lexptr; /* pick up } next time */ + return NEWLINE; + + case '"': + esc_seen = 0; + while ((c = nextc()) != '"') { + if (c == '\n') { + pushback(); + yyerror("unterminated string"); + } + if (c == '\\') { + c = nextc(); + if (c == '\n') { + sourceline++; + continue; + } + esc_seen = 1; + tokadd('\\'); + } + if (c == '\0') { + pushback(); + yyerror("unterminated string"); + } + tokadd(c); + } + yylval.nodeval = make_str_node(tokstart, + token - tokstart, esc_seen ? SCAN : 0); + yylval.nodeval->flags |= PERM; + return YSTRING; + + case '-': + if ((c = nextc()) == '=') { + yylval.nodetypeval = Node_assign_minus; + return ASSIGNOP; + } + if (c == '-') + return DECREMENT; + pushback(); + return '-'; + + case '.': + c = nextc(); + pushback(); + if (!isdigit(c)) + return '.'; + else + c = '.'; /* FALL THROUGH */ + case '0': + case '1': + case '2': + case '3': + case '4': + case '5': + case '6': + case '7': + case '8': + case '9': + /* It's a number */ + for (;;) { + int gotnumber = 0; + + tokadd(c); + switch (c) { + case '.': + if (seen_point) { + gotnumber++; + break; + } + ++seen_point; + break; + case 'e': + case 'E': + if (seen_e) { + gotnumber++; + break; + } + ++seen_e; + if ((c = nextc()) == '-' || c == '+') + tokadd(c); + else + pushback(); + break; + case '0': + case '1': + case '2': + case '3': + case '4': + case '5': + case '6': + case '7': + case '8': + case '9': + break; + default: + gotnumber++; + } + if (gotnumber) + break; + c = nextc(); + } + pushback(); + yylval.nodeval = make_number(atof(tokstart)); + yylval.nodeval->flags |= PERM; + return YNUMBER; + + case '&': + if ((c = nextc()) == '&') { + yylval.nodetypeval = Node_and; + for (;;) { + c = nextc(); + if (c == '\0') + break; + if (c == '#') { + while ((c = nextc()) != '\n' && c != '\0') + ; + if (c == '\0') + break; + } + if (c == '\n') + sourceline++; + if (! isspace(c)) { + pushback(); + break; + } + } + want_assign = 0; + return LEX_AND; + } + pushback(); + return '&'; + + case '|': + if ((c = nextc()) == '|') { + yylval.nodetypeval = Node_or; + for (;;) { + c = nextc(); + if (c == '\0') + break; + if (c == '#') { + while ((c = nextc()) != '\n' && c != '\0') + ; + if (c == '\0') + break; + } + if (c == '\n') + sourceline++; + if (! isspace(c)) { + pushback(); + break; + } + } + want_assign = 0; + return LEX_OR; + } + pushback(); + return '|'; + } + + if (c != '_' && ! isalpha(c)) + yyerror("Invalid char '%c' in expression\n", c); + + /* it's some type of name-type-thing. Find its length */ + token = tokstart; + while (is_identchar(c)) { + tokadd(c); + c = nextc(); + } + tokadd('\0'); + emalloc(tokkey, char *, token - tokstart, "yylex"); + memcpy(tokkey, tokstart, token - tokstart); + pushback(); + + /* See if it is a special token. */ + low = 0; + high = (sizeof (tokentab) / sizeof (tokentab[0])) - 1; + while (low <= high) { + int i/* , c */; + + mid = (low + high) / 2; + c = *tokstart - tokentab[mid].operator[0]; + i = c ? c : strcmp (tokstart, tokentab[mid].operator); + + if (i < 0) { /* token < mid */ + high = mid - 1; + } else if (i > 0) { /* token > mid */ + low = mid + 1; + } else { + if (do_lint) { + if (tokentab[mid].flags & GAWKX) + warning("%s() is a gawk extension", + tokentab[mid].operator); + if (tokentab[mid].flags & NOT_POSIX) + warning("POSIX does not allow %s", + tokentab[mid].operator); + if (tokentab[mid].flags & NOT_OLD) + warning("%s is not supported in old awk", + tokentab[mid].operator); + } + if ((do_unix && (tokentab[mid].flags & GAWKX)) + || (do_posix && (tokentab[mid].flags & NOT_POSIX))) + break; + if (tokentab[mid].class == LEX_BUILTIN + || tokentab[mid].class == LEX_LENGTH + ) + yylval.lval = mid; + else + yylval.nodetypeval = tokentab[mid].value; + + return tokentab[mid].class; + } + } + + yylval.sval = tokkey; + if (*lexptr == '(') + return FUNC_CALL; + else { + want_assign = 1; + return NAME; + } +} + +static NODE * +node_common(op) +NODETYPE op; +{ + register NODE *r; + + getnode(r); + r->type = op; + r->flags = MALLOC; + /* if lookahead is NL, lineno is 1 too high */ + if (lexeme && *lexeme == '\n') + r->source_line = sourceline - 1; + else + r->source_line = sourceline; + r->source_file = source; + return r; +} + +/* + * This allocates a node with defined lnode and rnode. + */ +NODE * +node(left, op, right) +NODE *left, *right; +NODETYPE op; +{ + register NODE *r; + + r = node_common(op); + r->lnode = left; + r->rnode = right; + return r; +} + +/* + * This allocates a node with defined subnode and proc for builtin functions + * Checks for arg. count and supplies defaults where possible. + */ +static NODE * +snode(subn, op, idx) +NODETYPE op; +int idx; +NODE *subn; +{ + register NODE *r; + register NODE *n; + int nexp = 0; + int args_allowed; + + r = node_common(op); + + /* traverse expression list to see how many args. given */ + for (n= subn; n; n= n->rnode) { + nexp++; + if (nexp > 3) + break; + } + + /* check against how many args. are allowed for this builtin */ + args_allowed = tokentab[idx].flags & ARGS; + if (args_allowed && !(args_allowed & A(nexp))) + fatal("%s() cannot have %d argument%c", + tokentab[idx].operator, nexp, nexp == 1 ? ' ' : 's'); + + r->proc = tokentab[idx].ptr; + + /* special case processing for a few builtins */ + if (nexp == 0 && r->proc == do_length) { + subn = node(node(make_number(0.0),Node_field_spec,(NODE *)NULL), + Node_expression_list, + (NODE *) NULL); + } else if (r->proc == do_match) { + if (subn->rnode->lnode->type != Node_regex) + subn->rnode->lnode = mk_rexp(subn->rnode->lnode); + } else if (r->proc == do_sub || r->proc == do_gsub) { + if (subn->lnode->type != Node_regex) + subn->lnode = mk_rexp(subn->lnode); + if (nexp == 2) + append_right(subn, node(node(make_number(0.0), + Node_field_spec, + (NODE *) NULL), + Node_expression_list, + (NODE *) NULL)); + else if (do_lint && subn->rnode->rnode->lnode->type == Node_val) + warning("string literal as last arg of substitute"); + } else if (r->proc == do_split) { + if (nexp == 2) + append_right(subn, + node(FS_node, Node_expression_list, (NODE *) NULL)); + n = subn->rnode->rnode->lnode; + if (n->type != Node_regex) + subn->rnode->rnode->lnode = mk_rexp(n); + if (nexp == 2) + subn->rnode->rnode->lnode->re_flags |= FS_DFLT; + } + + r->subnode = subn; + return r; +} + +/* + * This allocates a Node_line_range node with defined condpair and + * zeroes the trigger word to avoid the temptation of assuming that calling + * 'node( foo, Node_line_range, 0)' will properly initialize 'triggered'. + */ +/* Otherwise like node() */ +static NODE * +mkrangenode(cpair) +NODE *cpair; +{ + register NODE *r; + + getnode(r); + r->type = Node_line_range; + r->condpair = cpair; + r->triggered = 0; + return r; +} + +/* Build a for loop */ +static NODE * +make_for_loop(init, cond, incr) +NODE *init, *cond, *incr; +{ + register FOR_LOOP_HEADER *r; + NODE *n; + + emalloc(r, FOR_LOOP_HEADER *, sizeof(FOR_LOOP_HEADER), "make_for_loop"); + getnode(n); + n->type = Node_illegal; + r->init = init; + r->cond = cond; + r->incr = incr; + n->sub.nodep.r.hd = r; + return n; +} + +/* + * Install a name in the symbol table, even if it is already there. + * Caller must check against redefinition if that is desired. + */ +NODE * +install(name, value) +char *name; +NODE *value; +{ + register NODE *hp; + register int len, bucket; + + len = strlen(name); + bucket = hash(name, len); + getnode(hp); + hp->type = Node_hashnode; + hp->hnext = variables[bucket]; + variables[bucket] = hp; + hp->hlength = len; + hp->hvalue = value; + hp->hname = name; + hp->hvalue->vname = name; + return hp->hvalue; +} + +/* find the most recent hash node for name installed by install */ +NODE * +lookup(name) +char *name; +{ + register NODE *bucket; + register int len; + + len = strlen(name); + bucket = variables[hash(name, len)]; + while (bucket) { + if (bucket->hlength == len && STREQN(bucket->hname, name, len)) + return bucket->hvalue; + bucket = bucket->hnext; + } + return NULL; +} + +/* + * Add new to the rightmost branch of LIST. This uses n^2 time, so we make + * a simple attempt at optimizing it. + */ +static NODE * +append_right(list, new) +NODE *list, *new; +{ + register NODE *oldlist; + static NODE *savefront = NULL, *savetail = NULL; + + oldlist = list; + if (savefront == oldlist) { + savetail = savetail->rnode = new; + return oldlist; + } else + savefront = oldlist; + while (list->rnode != NULL) + list = list->rnode; + savetail = list->rnode = new; + return oldlist; +} + +/* + * check if name is already installed; if so, it had better have Null value, + * in which case def is added as the value. Otherwise, install name with def + * as value. + */ +static void +func_install(params, def) +NODE *params; +NODE *def; +{ + NODE *r; + + pop_params(params->rnode); + pop_var(params, 0); + r = lookup(params->param); + if (r != NULL) { + fatal("function name `%s' previously defined", params->param); + } else + (void) install(params->param, node(params, Node_func, def)); +} + +static void +pop_var(np, freeit) +NODE *np; +int freeit; +{ + register NODE *bucket, **save; + register int len; + char *name; + + name = np->param; + len = strlen(name); + save = &(variables[hash(name, len)]); + for (bucket = *save; bucket; bucket = bucket->hnext) { + if (len == bucket->hlength && STREQN(bucket->hname, name, len)) { + *save = bucket->hnext; + freenode(bucket); + if (freeit) + free(np->param); + return; + } + save = &(bucket->hnext); + } +} + +static void +pop_params(params) +NODE *params; +{ + register NODE *np; + + for (np = params; np != NULL; np = np->rnode) + pop_var(np, 1); +} + +static NODE * +make_param(name) +char *name; +{ + NODE *r; + + getnode(r); + r->type = Node_param_list; + r->rnode = NULL; + r->param = name; + r->param_cnt = param_counter++; + return (install(name, r)); +} + +/* Name points to a variable name. Make sure its in the symbol table */ +NODE * +variable(name, can_free) +char *name; +int can_free; +{ + register NODE *r; + static int env_loaded = 0; + + if (!env_loaded && STREQ(name, "ENVIRON")) { + load_environ(); + env_loaded = 1; + } + if ((r = lookup(name)) == NULL) + r = install(name, node(Nnull_string, Node_var, (NODE *) NULL)); + else if (can_free) + free(name); + return r; +} + +static NODE * +mk_rexp(exp) +NODE *exp; +{ + if (exp->type == Node_regex) + return exp; + else { + NODE *n; + + getnode(n); + n->type = Node_regex; + n->re_exp = exp; + n->re_text = NULL; + n->re_reg = NULL; + n->re_flags = 0; + n->re_cnt = 1; + return n; + } +} diff --git a/gnu/usr.bin/awk/builtin.c b/gnu/usr.bin/awk/builtin.c new file mode 100644 index 000000000000..9d5e3b302fde --- /dev/null +++ b/gnu/usr.bin/awk/builtin.c @@ -0,0 +1,1133 @@ +/* + * builtin.c - Builtin functions and various utility procedures + */ + +/* + * Copyright (C) 1986, 1988, 1989, 1991, 1992 the Free Software Foundation, Inc. + * + * This file is part of GAWK, the GNU implementation of the + * AWK Progamming Language. + * + * GAWK is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * GAWK is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with GAWK; see the file COPYING. If not, write to + * the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + + +#include "awk.h" + + +#ifndef SRANDOM_PROTO +extern void srandom P((int seed)); +#endif +#ifndef linux +extern char *initstate P((unsigned seed, char *state, int n)); +extern char *setstate P((char *state)); +extern long random P((void)); +#endif + +extern NODE **fields_arr; +extern int output_is_tty; + +static NODE *sub_common P((NODE *tree, int global)); + +#ifdef GFMT_WORKAROUND +char *gfmt P((double g, int prec, char *buf)); +#endif + +#ifdef _CRAY +/* Work around a problem in conversion of doubles to exact integers. */ +#include <float.h> +#define Floor(n) floor((n) * (1.0 + DBL_EPSILON)) +#define Ceil(n) ceil((n) * (1.0 + DBL_EPSILON)) + +/* Force the standard C compiler to use the library math functions. */ +extern double exp(double); +double (*Exp)() = exp; +#define exp(x) (*Exp)(x) +extern double log(double); +double (*Log)() = log; +#define log(x) (*Log)(x) +#else +#define Floor(n) floor(n) +#define Ceil(n) ceil(n) +#endif + +static void +efwrite(ptr, size, count, fp, from, rp, flush) +void *ptr; +unsigned size, count; +FILE *fp; +char *from; +struct redirect *rp; +int flush; +{ + errno = 0; + if (fwrite(ptr, size, count, fp) != count) + goto wrerror; + if (flush + && ((fp == stdout && output_is_tty) + || (rp && (rp->flag & RED_NOBUF)))) { + fflush(fp); + if (ferror(fp)) + goto wrerror; + } + return; + + wrerror: + fatal("%s to \"%s\" failed (%s)", from, + rp ? rp->value : "standard output", + errno ? strerror(errno) : "reason unknown"); +} + +/* Builtin functions */ +NODE * +do_exp(tree) +NODE *tree; +{ + NODE *tmp; + double d, res; +#ifndef exp + double exp P((double)); +#endif + + tmp= tree_eval(tree->lnode); + d = force_number(tmp); + free_temp(tmp); + errno = 0; + res = exp(d); + if (errno == ERANGE) + warning("exp argument %g is out of range", d); + return tmp_number((AWKNUM) res); +} + +NODE * +do_index(tree) +NODE *tree; +{ + NODE *s1, *s2; + register char *p1, *p2; + register int l1, l2; + long ret; + + + s1 = tree_eval(tree->lnode); + s2 = tree_eval(tree->rnode->lnode); + force_string(s1); + force_string(s2); + p1 = s1->stptr; + p2 = s2->stptr; + l1 = s1->stlen; + l2 = s2->stlen; + ret = 0; + if (IGNORECASE) { + while (l1) { + if (l2 > l1) + break; + if (casetable[(int)*p1] == casetable[(int)*p2] + && (l2 == 1 || strncasecmp(p1, p2, l2) == 0)) { + ret = 1 + s1->stlen - l1; + break; + } + l1--; + p1++; + } + } else { + while (l1) { + if (l2 > l1) + break; + if (*p1 == *p2 + && (l2 == 1 || STREQN(p1, p2, l2))) { + ret = 1 + s1->stlen - l1; + break; + } + l1--; + p1++; + } + } + free_temp(s1); + free_temp(s2); + return tmp_number((AWKNUM) ret); +} + +NODE * +do_int(tree) +NODE *tree; +{ + NODE *tmp; + double floor P((double)); + double ceil P((double)); + double d; + + tmp = tree_eval(tree->lnode); + d = force_number(tmp); + if (d >= 0) + d = Floor(d); + else + d = Ceil(d); + free_temp(tmp); + return tmp_number((AWKNUM) d); +} + +NODE * +do_length(tree) +NODE *tree; +{ + NODE *tmp; + int len; + + tmp = tree_eval(tree->lnode); + len = force_string(tmp)->stlen; + free_temp(tmp); + return tmp_number((AWKNUM) len); +} + +NODE * +do_log(tree) +NODE *tree; +{ + NODE *tmp; +#ifndef log + double log P((double)); +#endif + double d, arg; + + tmp = tree_eval(tree->lnode); + arg = (double) force_number(tmp); + if (arg < 0.0) + warning("log called with negative argument %g", arg); + d = log(arg); + free_temp(tmp); + return tmp_number((AWKNUM) d); +} + +/* %e and %f formats are not properly implemented. Someone should fix them */ +/* Actually, this whole thing should be reimplemented. */ + +NODE * +do_sprintf(tree) +NODE *tree; +{ +#define bchunk(s,l) if(l) {\ + while((l)>ofre) {\ + erealloc(obuf, char *, osiz*2, "do_sprintf");\ + ofre+=osiz;\ + osiz*=2;\ + }\ + memcpy(obuf+olen,s,(l));\ + olen+=(l);\ + ofre-=(l);\ + } + + /* Is there space for something L big in the buffer? */ +#define chksize(l) if((l)>ofre) {\ + erealloc(obuf, char *, osiz*2, "do_sprintf");\ + ofre+=osiz;\ + osiz*=2;\ + } + + /* + * Get the next arg to be formatted. If we've run out of args, + * return "" (Null string) + */ +#define parse_next_arg() {\ + if(!carg) { toofew = 1; break; }\ + else {\ + arg=tree_eval(carg->lnode);\ + carg=carg->rnode;\ + }\ + } + + NODE *r; + int toofew = 0; + char *obuf; + int osiz, ofre, olen; + static char chbuf[] = "0123456789abcdef"; + static char sp[] = " "; + char *s0, *s1; + int n0; + NODE *sfmt, *arg; + register NODE *carg; + long fw, prec, lj, alt, big; + long *cur; + long val; +#ifdef sun386 /* Can't cast unsigned (int/long) from ptr->value */ + long tmp_uval; /* on 386i 4.0.1 C compiler -- it just hangs */ +#endif + unsigned long uval; + int sgn; + int base; + char cpbuf[30]; /* if we have numbers bigger than 30 */ + char *cend = &cpbuf[30];/* chars, we lose, but seems unlikely */ + char *cp; + char *fill; + double tmpval; + char *pr_str; + int ucasehex = 0; + char signchar = 0; + int len; + + + emalloc(obuf, char *, 120, "do_sprintf"); + osiz = 120; + ofre = osiz - 1; + olen = 0; + sfmt = tree_eval(tree->lnode); + sfmt = force_string(sfmt); + carg = tree->rnode; + for (s0 = s1 = sfmt->stptr, n0 = sfmt->stlen; n0-- > 0;) { + if (*s1 != '%') { + s1++; + continue; + } + bchunk(s0, s1 - s0); + s0 = s1; + cur = &fw; + fw = 0; + prec = 0; + lj = alt = big = 0; + fill = sp; + cp = cend; + s1++; + +retry: + --n0; + switch (*s1++) { + case '%': + bchunk("%", 1); + s0 = s1; + break; + + case '0': + if (fill != sp || lj) + goto lose; + if (cur == &fw) + fill = "0"; /* FALL through */ + case '1': + case '2': + case '3': + case '4': + case '5': + case '6': + case '7': + case '8': + case '9': + if (cur == 0) + goto lose; + *cur = s1[-1] - '0'; + while (n0 > 0 && *s1 >= '0' && *s1 <= '9') { + --n0; + *cur = *cur * 10 + *s1++ - '0'; + } + goto retry; + case '*': + if (cur == 0) + goto lose; + parse_next_arg(); + *cur = force_number(arg); + free_temp(arg); + goto retry; + case ' ': /* print ' ' or '-' */ + case '+': /* print '+' or '-' */ + signchar = *(s1-1); + goto retry; + case '-': + if (lj || fill != sp) + goto lose; + lj++; + goto retry; + case '.': + if (cur != &fw) + goto lose; + cur = ≺ + goto retry; + case '#': + if (alt) + goto lose; + alt++; + goto retry; + case 'l': + if (big) + goto lose; + big++; + goto retry; + case 'c': + parse_next_arg(); + if (arg->flags & NUMBER) { +#ifdef sun386 + tmp_uval = arg->numbr; + uval= (unsigned long) tmp_uval; +#else + uval = (unsigned long) arg->numbr; +#endif + cpbuf[0] = uval; + prec = 1; + pr_str = cpbuf; + goto dopr_string; + } + if (! prec) + prec = 1; + else if (prec > arg->stlen) + prec = arg->stlen; + pr_str = arg->stptr; + goto dopr_string; + case 's': + parse_next_arg(); + arg = force_string(arg); + if (!prec || prec > arg->stlen) + prec = arg->stlen; + pr_str = arg->stptr; + + dopr_string: + if (fw > prec && !lj) { + while (fw > prec) { + bchunk(sp, 1); + fw--; + } + } + bchunk(pr_str, (int) prec); + if (fw > prec) { + while (fw > prec) { + bchunk(sp, 1); + fw--; + } + } + s0 = s1; + free_temp(arg); + break; + case 'd': + case 'i': + parse_next_arg(); + val = (long) force_number(arg); + free_temp(arg); + if (val < 0) { + sgn = 1; + val = -val; + } else + sgn = 0; + do { + *--cp = '0' + val % 10; + val /= 10; + } while (val); + if (sgn) + *--cp = '-'; + else if (signchar) + *--cp = signchar; + if (prec > fw) + fw = prec; + prec = cend - cp; + if (fw > prec && !lj) { + if (fill != sp && (*cp == '-' || signchar)) { + bchunk(cp, 1); + cp++; + prec--; + fw--; + } + while (fw > prec) { + bchunk(fill, 1); + fw--; + } + } + bchunk(cp, (int) prec); + if (fw > prec) { + while (fw > prec) { + bchunk(fill, 1); + fw--; + } + } + s0 = s1; + break; + case 'u': + base = 10; + goto pr_unsigned; + case 'o': + base = 8; + goto pr_unsigned; + case 'X': + ucasehex = 1; + case 'x': + base = 16; + goto pr_unsigned; + pr_unsigned: + parse_next_arg(); + uval = (unsigned long) force_number(arg); + free_temp(arg); + do { + *--cp = chbuf[uval % base]; + if (ucasehex && isalpha(*cp)) + *cp = toupper(*cp); + uval /= base; + } while (uval); + if (alt && (base == 8 || base == 16)) { + if (base == 16) { + if (ucasehex) + *--cp = 'X'; + else + *--cp = 'x'; + } + *--cp = '0'; + } + prec = cend - cp; + if (fw > prec && !lj) { + while (fw > prec) { + bchunk(fill, 1); + fw--; + } + } + bchunk(cp, (int) prec); + if (fw > prec) { + while (fw > prec) { + bchunk(fill, 1); + fw--; + } + } + s0 = s1; + break; + case 'g': + parse_next_arg(); + tmpval = force_number(arg); + free_temp(arg); + chksize(fw + prec + 9); /* 9==slop */ + + cp = cpbuf; + *cp++ = '%'; + if (lj) + *cp++ = '-'; + if (fill != sp) + *cp++ = '0'; +#ifndef GFMT_WORKAROUND + if (cur != &fw) { + (void) strcpy(cp, "*.*g"); + (void) sprintf(obuf + olen, cpbuf, (int) fw, (int) prec, (double) tmpval); + } else { + (void) strcpy(cp, "*g"); + (void) sprintf(obuf + olen, cpbuf, (int) fw, (double) tmpval); + } +#else /* GFMT_WORKAROUND */ + { + char *gptr, gbuf[120]; +#define DEFAULT_G_PRECISION 6 + if (fw + prec + 9 > sizeof gbuf) { /* 9==slop */ + emalloc(gptr, char *, fw+prec+9, "do_sprintf(gfmt)"); + } else + gptr = gbuf; + (void) gfmt((double) tmpval, cur != &fw ? + (int) prec : DEFAULT_G_PRECISION, gptr); + *cp++ = '*', *cp++ = 's', *cp = '\0'; + (void) sprintf(obuf + olen, cpbuf, (int) fw, gptr); + if (fill != sp && *gptr == ' ') { + char *p = gptr; + do { *p++ = '0'; } while (*p == ' '); + } + if (gptr != gbuf) free(gptr); + } +#endif /* GFMT_WORKAROUND */ + len = strlen(obuf + olen); + ofre -= len; + olen += len; + s0 = s1; + break; + + case 'f': + parse_next_arg(); + tmpval = force_number(arg); + free_temp(arg); + chksize(fw + prec + 9); /* 9==slop */ + + cp = cpbuf; + *cp++ = '%'; + if (lj) + *cp++ = '-'; + if (fill != sp) + *cp++ = '0'; + if (cur != &fw) { + (void) strcpy(cp, "*.*f"); + (void) sprintf(obuf + olen, cpbuf, (int) fw, (int) prec, (double) tmpval); + } else { + (void) strcpy(cp, "*f"); + (void) sprintf(obuf + olen, cpbuf, (int) fw, (double) tmpval); + } + len = strlen(obuf + olen); + ofre -= len; + olen += len; + s0 = s1; + break; + case 'e': + parse_next_arg(); + tmpval = force_number(arg); + free_temp(arg); + chksize(fw + prec + 9); /* 9==slop */ + cp = cpbuf; + *cp++ = '%'; + if (lj) + *cp++ = '-'; + if (fill != sp) + *cp++ = '0'; + if (cur != &fw) { + (void) strcpy(cp, "*.*e"); + (void) sprintf(obuf + olen, cpbuf, (int) fw, (int) prec, (double) tmpval); + } else { + (void) strcpy(cp, "*e"); + (void) sprintf(obuf + olen, cpbuf, (int) fw, (double) tmpval); + } + len = strlen(obuf + olen); + ofre -= len; + olen += len; + s0 = s1; + break; + + default: + lose: + break; + } + if (toofew) + fatal("%s\n\t%s\n\t%*s%s", + "not enough arguments to satisfy format string", + sfmt->stptr, s1 - sfmt->stptr - 2, "", + "^ ran out for this one" + ); + } + if (do_lint && carg != NULL) + warning("too many arguments supplied for format string"); + bchunk(s0, s1 - s0); + free_temp(sfmt); + r = make_str_node(obuf, olen, ALREADY_MALLOCED); + r->flags |= TEMP; + return r; +} + +void +do_printf(tree) +register NODE *tree; +{ + struct redirect *rp = NULL; + register FILE *fp; + + if (tree->rnode) { + int errflg; /* not used, sigh */ + + rp = redirect(tree->rnode, &errflg); + if (rp) { + fp = rp->fp; + if (!fp) + return; + } else + return; + } else + fp = stdout; + tree = do_sprintf(tree->lnode); + efwrite(tree->stptr, sizeof(char), tree->stlen, fp, "printf", rp , 1); + free_temp(tree); +} + +NODE * +do_sqrt(tree) +NODE *tree; +{ + NODE *tmp; + double arg; + extern double sqrt P((double)); + + tmp = tree_eval(tree->lnode); + arg = (double) force_number(tmp); + free_temp(tmp); + if (arg < 0.0) + warning("sqrt called with negative argument %g", arg); + return tmp_number((AWKNUM) sqrt(arg)); +} + +NODE * +do_substr(tree) +NODE *tree; +{ + NODE *t1, *t2, *t3; + NODE *r; + register int indx; + size_t length; + + t1 = tree_eval(tree->lnode); + t2 = tree_eval(tree->rnode->lnode); + if (tree->rnode->rnode == NULL) /* third arg. missing */ + length = t1->stlen; + else { + t3 = tree_eval(tree->rnode->rnode->lnode); + length = (size_t) force_number(t3); + free_temp(t3); + } + indx = (int) force_number(t2) - 1; + free_temp(t2); + t1 = force_string(t1); + if (indx < 0) + indx = 0; + if (indx >= t1->stlen || length <= 0) { + free_temp(t1); + return Nnull_string; + } + if (indx + length > t1->stlen || LONG_MAX - indx < length) + length = t1->stlen - indx; + r = tmp_string(t1->stptr + indx, length); + free_temp(t1); + return r; +} + +NODE * +do_strftime(tree) +NODE *tree; +{ + NODE *t1, *t2; + struct tm *tm; + time_t fclock; + char buf[100]; + int ret; + + t1 = force_string(tree_eval(tree->lnode)); + + if (tree->rnode == NULL) /* second arg. missing, default */ + (void) time(&fclock); + else { + t2 = tree_eval(tree->rnode->lnode); + fclock = (time_t) force_number(t2); + free_temp(t2); + } + tm = localtime(&fclock); + + ret = strftime(buf, 100, t1->stptr, tm); + + return tmp_string(buf, ret); +} + +NODE * +do_systime(tree) +NODE *tree; +{ + time_t lclock; + + (void) time(&lclock); + return tmp_number((AWKNUM) lclock); +} + +NODE * +do_system(tree) +NODE *tree; +{ + NODE *tmp; + int ret = 0; + char *cmd; + + (void) flush_io (); /* so output is synchronous with gawk's */ + tmp = tree_eval(tree->lnode); + cmd = force_string(tmp)->stptr; + if (cmd && *cmd) { + ret = system(cmd); + ret = (ret >> 8) & 0xff; + } + free_temp(tmp); + return tmp_number((AWKNUM) ret); +} + +void +do_print(tree) +register NODE *tree; +{ + register NODE *t1; + struct redirect *rp = NULL; + register FILE *fp; + register char *s; + + if (tree->rnode) { + int errflg; /* not used, sigh */ + + rp = redirect(tree->rnode, &errflg); + if (rp) { + fp = rp->fp; + if (!fp) + return; + } else + return; + } else + fp = stdout; + tree = tree->lnode; + while (tree) { + t1 = tree_eval(tree->lnode); + if (t1->flags & NUMBER) { + if (OFMTidx == CONVFMTidx) + (void) force_string(t1); + else { + char buf[100]; + + sprintf(buf, OFMT, t1->numbr); + t1 = tmp_string(buf, strlen(buf)); + } + } + efwrite(t1->stptr, sizeof(char), t1->stlen, fp, "print", rp, 0); + free_temp(t1); + tree = tree->rnode; + if (tree) { + s = OFS; + if (OFSlen) + efwrite(s, sizeof(char), OFSlen, fp, "print", rp, 0); + } + } + s = ORS; + if (ORSlen) + efwrite(s, sizeof(char), ORSlen, fp, "print", rp, 1); +} + +NODE * +do_tolower(tree) +NODE *tree; +{ + NODE *t1, *t2; + register char *cp, *cp2; + + t1 = tree_eval(tree->lnode); + t1 = force_string(t1); + t2 = tmp_string(t1->stptr, t1->stlen); + for (cp = t2->stptr, cp2 = t2->stptr + t2->stlen; cp < cp2; cp++) + if (isupper(*cp)) + *cp = tolower(*cp); + free_temp(t1); + return t2; +} + +NODE * +do_toupper(tree) +NODE *tree; +{ + NODE *t1, *t2; + register char *cp; + + t1 = tree_eval(tree->lnode); + t1 = force_string(t1); + t2 = tmp_string(t1->stptr, t1->stlen); + for (cp = t2->stptr; cp < t2->stptr + t2->stlen; cp++) + if (islower(*cp)) + *cp = toupper(*cp); + free_temp(t1); + return t2; +} + +NODE * +do_atan2(tree) +NODE *tree; +{ + NODE *t1, *t2; + extern double atan2 P((double, double)); + double d1, d2; + + t1 = tree_eval(tree->lnode); + t2 = tree_eval(tree->rnode->lnode); + d1 = force_number(t1); + d2 = force_number(t2); + free_temp(t1); + free_temp(t2); + return tmp_number((AWKNUM) atan2(d1, d2)); +} + +NODE * +do_sin(tree) +NODE *tree; +{ + NODE *tmp; + extern double sin P((double)); + double d; + + tmp = tree_eval(tree->lnode); + d = sin((double)force_number(tmp)); + free_temp(tmp); + return tmp_number((AWKNUM) d); +} + +NODE * +do_cos(tree) +NODE *tree; +{ + NODE *tmp; + extern double cos P((double)); + double d; + + tmp = tree_eval(tree->lnode); + d = cos((double)force_number(tmp)); + free_temp(tmp); + return tmp_number((AWKNUM) d); +} + +static int firstrand = 1; +static char state[256]; + +/* ARGSUSED */ +NODE * +do_rand(tree) +NODE *tree; +{ + if (firstrand) { + (void) initstate((unsigned) 1, state, sizeof state); + srandom(1); + firstrand = 0; + } + return tmp_number((AWKNUM) random() / LONG_MAX); +} + +NODE * +do_srand(tree) +NODE *tree; +{ + NODE *tmp; + static long save_seed = 0; + long ret = save_seed; /* SVR4 awk srand returns previous seed */ + + if (firstrand) + (void) initstate((unsigned) 1, state, sizeof state); + else + (void) setstate(state); + + if (!tree) + srandom((int) (save_seed = (long) time((time_t *) 0))); + else { + tmp = tree_eval(tree->lnode); + srandom((int) (save_seed = (long) force_number(tmp))); + free_temp(tmp); + } + firstrand = 0; + return tmp_number((AWKNUM) ret); +} + +NODE * +do_match(tree) +NODE *tree; +{ + NODE *t1; + int rstart; + AWKNUM rlength; + Regexp *rp; + + t1 = force_string(tree_eval(tree->lnode)); + tree = tree->rnode->lnode; + rp = re_update(tree); + rstart = research(rp, t1->stptr, 0, t1->stlen, 1); + if (rstart >= 0) { /* match succeded */ + rstart++; /* 1-based indexing */ + rlength = REEND(rp, t1->stptr) - RESTART(rp, t1->stptr); + } else { /* match failed */ + rstart = 0; + rlength = -1.0; + } + free_temp(t1); + unref(RSTART_node->var_value); + RSTART_node->var_value = make_number((AWKNUM) rstart); + unref(RLENGTH_node->var_value); + RLENGTH_node->var_value = make_number(rlength); + return tmp_number((AWKNUM) rstart); +} + +static NODE * +sub_common(tree, global) +NODE *tree; +int global; +{ + register char *scan; + register char *bp, *cp; + char *buf; + int buflen; + register char *matchend; + register int len; + char *matchstart; + char *text; + int textlen; + char *repl; + char *replend; + int repllen; + int sofar; + int ampersands; + int matches = 0; + Regexp *rp; + NODE *s; /* subst. pattern */ + NODE *t; /* string to make sub. in; $0 if none given */ + NODE *tmp; + NODE **lhs = &tree; /* value not used -- just different from NULL */ + int priv = 0; + Func_ptr after_assign = NULL; + + tmp = tree->lnode; + rp = re_update(tmp); + + tree = tree->rnode; + s = tree->lnode; + + tree = tree->rnode; + tmp = tree->lnode; + t = force_string(tree_eval(tmp)); + + /* do the search early to avoid work on non-match */ + if (research(rp, t->stptr, 0, t->stlen, 1) == -1 || + (RESTART(rp, t->stptr) > t->stlen) && (matches = 1)) { + free_temp(t); + return tmp_number((AWKNUM) matches); + } + + if (tmp->type == Node_val) + lhs = NULL; + else + lhs = get_lhs(tmp, &after_assign); + t->flags |= STRING; + /* + * create a private copy of the string + */ + if (t->stref > 1 || (t->flags & PERM)) { + unsigned int saveflags; + + saveflags = t->flags; + t->flags &= ~MALLOC; + tmp = dupnode(t); + t->flags = saveflags; + t = tmp; + priv = 1; + } + text = t->stptr; + textlen = t->stlen; + buflen = textlen + 2; + + s = force_string(tree_eval(s)); + repl = s->stptr; + replend = repl + s->stlen; + repllen = replend - repl; + emalloc(buf, char *, buflen, "do_sub"); + ampersands = 0; + for (scan = repl; scan < replend; scan++) { + if (*scan == '&') { + repllen--; + ampersands++; + } else if (*scan == '\\' && (*(scan+1) == '&' || *(scan+1) == '\\')) { + repllen--; + scan++; + } + } + + bp = buf; + for (;;) { + matches++; + matchstart = t->stptr + RESTART(rp, t->stptr); + matchend = t->stptr + REEND(rp, t->stptr); + + /* + * create the result, copying in parts of the original + * string + */ + len = matchstart - text + repllen + + ampersands * (matchend - matchstart); + sofar = bp - buf; + while (buflen - sofar - len - 1 < 0) { + buflen *= 2; + erealloc(buf, char *, buflen, "do_sub"); + bp = buf + sofar; + } + for (scan = text; scan < matchstart; scan++) + *bp++ = *scan; + for (scan = repl; scan < replend; scan++) + if (*scan == '&') + for (cp = matchstart; cp < matchend; cp++) + *bp++ = *cp; + else if (*scan == '\\' && (*(scan+1) == '&' || *(scan+1) == '\\')) { + scan++; + *bp++ = *scan; + } else + *bp++ = *scan; + if (global && matchstart == matchend && matchend < text + textlen) { + *bp++ = *matchend; + matchend++; + } + textlen = text + textlen - matchend; + text = matchend; + if (!global || textlen <= 0 || + research(rp, t->stptr, text-t->stptr, textlen, 1) == -1) + break; + } + sofar = bp - buf; + if (buflen - sofar - textlen - 1) { + buflen = sofar + textlen + 2; + erealloc(buf, char *, buflen, "do_sub"); + bp = buf + sofar; + } + for (scan = matchend; scan < text + textlen; scan++) + *bp++ = *scan; + textlen = bp - buf; + free(t->stptr); + t->stptr = buf; + t->stlen = textlen; + + free_temp(s); + if (matches > 0 && lhs) { + if (priv) { + unref(*lhs); + *lhs = t; + } + if (after_assign) + (*after_assign)(); + t->flags &= ~(NUM|NUMBER); + } + return tmp_number((AWKNUM) matches); +} + +NODE * +do_gsub(tree) +NODE *tree; +{ + return sub_common(tree, 1); +} + +NODE * +do_sub(tree) +NODE *tree; +{ + return sub_common(tree, 0); +} + +#ifdef GFMT_WORKAROUND + /* + * printf's %g format [can't rely on gcvt()] + * caveat: don't use as argument to *printf()! + */ +char * +gfmt(g, prec, buf) +double g; /* value to format */ +int prec; /* indicates desired significant digits, not decimal places */ +char *buf; /* return buffer; assumed big enough to hold result */ +{ + if (g == 0.0) { + (void) strcpy(buf, "0"); /* easy special case */ + } else { + register char *d, *e, *p; + + /* start with 'e' format (it'll provide nice exponent) */ + if (prec < 1) prec = 1; /* at least 1 significant digit */ + (void) sprintf(buf, "%.*e", prec - 1, g); + if ((e = strchr(buf, 'e')) != 0) { /* find exponent */ + int exp = atoi(e+1); /* fetch exponent */ + if (exp >= -4 && exp < prec) { /* per K&R2, B1.2 */ + /* switch to 'f' format and re-do */ + prec -= (exp + 1); /* decimal precision */ + (void) sprintf(buf, "%.*f", prec, g); + e = buf + strlen(buf); + } + if ((d = strchr(buf, '.')) != 0) { + /* remove trailing zeroes and decimal point */ + for (p = e; p > d && *--p == '0'; ) continue; + if (*p == '.') --p; + if (++p < e) /* copy exponent and NUL */ + while ((*p++ = *e++) != '\0') continue; + } + } + } + return buf; +} +#endif /* GFMT_WORKAROUND */ diff --git a/gnu/usr.bin/awk/config.h b/gnu/usr.bin/awk/config.h new file mode 100644 index 000000000000..8c20953ed531 --- /dev/null +++ b/gnu/usr.bin/awk/config.h @@ -0,0 +1,272 @@ +/* + * config.h -- configuration definitions for gawk. + * + * For generic 4.4 alpha + */ + +/* + * Copyright (C) 1991, 1992 the Free Software Foundation, Inc. + * + * This file is part of GAWK, the GNU implementation of the + * AWK Progamming Language. + * + * GAWK is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2, or (at your option) + * any later version. + * + * GAWK is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with GAWK; see the file COPYING. If not, write to + * the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +/* + * This file isolates configuration dependencies for gnu awk. + * You should know something about your system, perhaps by having + * a manual handy, when you edit this file. You should copy config.h-dist + * to config.h, and edit config.h. Do not modify config.h-dist, so that + * it will be easy to apply any patches that may be distributed. + * + * The general idea is that systems conforming to the various standards + * should need to do the least amount of changing. Definining the various + * items in ths file usually means that your system is missing that + * particular feature. + * + * The order of preference in standard conformance is ANSI C, POSIX, + * and the SVID. + * + * If you have no clue as to what's going on with your system, try + * compiling gawk without editing this file and see what shows up + * missing in the link stage. From there, you can probably figure out + * which defines to turn on. + */ + +/**************************/ +/* Miscellanious features */ +/**************************/ + +/* + * BLKSIZE_MISSING + * + * Check your /usr/include/sys/stat.h file. If the stat structure + * does not have a member named st_blksize, define this. (This will + * most likely be the case on most System V systems prior to V.4.) + */ +/* #define BLKSIZE_MISSING 1 */ + +/* + * SIGTYPE + * + * The return type of the routines passed to the signal function. + * Modern systems use `void', older systems use `int'. + * If left undefined, it will default to void. + */ +/* #define SIGTYPE int */ + +/* + * SIZE_T_MISSING + * + * If your system has no typedef for size_t, define this to get a default + */ +/* #define SIZE_T_MISSING 1 */ + +/* + * CHAR_UNSIGNED + * + * If your machine uses unsigned characters (IBM RT and RS/6000 and others) + * then define this for use in regex.c + */ +/* #define CHAR_UNSIGNED 1 */ + +/* + * HAVE_UNDERSCORE_SETJMP + * + * Check in your /usr/include/setjmp.h file. If there are routines + * there named _setjmp and _longjmp, then you should define this. + * Typically only systems derived from Berkeley Unix have this. + */ +#define HAVE_UNDERSCORE_SETJMP 1 + +/***********************************************/ +/* Missing library subroutines or system calls */ +/***********************************************/ + +/* + * MEMCMP_MISSING + * MEMCPY_MISSING + * MEMSET_MISSING + * + * These three routines are for manipulating blocks of memory. Most + * likely they will either all three be present or all three be missing, + * so they're grouped together. + */ +/* #define MEMCMP_MISSING 1 */ +/* #define MEMCPY_MISSING 1 */ +/* #define MEMSET_MISSING 1 */ + +/* + * RANDOM_MISSING + * + * Your system does not have the random(3) suite of random number + * generating routines. These are different than the old rand(3) + * routines! + */ +/* #define RANDOM_MISSING 1 */ + +/* + * STRCASE_MISSING + * + * Your system does not have the strcasemp() and strncasecmp() + * routines that originated in Berkeley Unix. + */ +/* #define STRCASE_MISSING 1 */ + +/* + * STRCHR_MISSING + * + * Your system does not have the strchr() and strrchr() functions. + */ +/* #define STRCHR_MISSING 1 */ + +/* + * STRERROR_MISSING + * + * Your system lacks the ANSI C strerror() routine for returning the + * strings associated with errno values. + */ +/* #define STRERROR_MISSING 1 */ + +/* + * STRTOD_MISSING + * + * Your system does not have the strtod() routine for converting + * strings to double precision floating point values. + */ +/* #define STRTOD_MISSING 1 */ + +/* + * STRFTIME_MISSING + * + * Your system lacks the ANSI C strftime() routine for formatting + * broken down time values. + */ +/* #define STRFTIME_MISSING 1 */ + +/* + * TZSET_MISSING + * + * If you have a 4.2 BSD vintage system, then the strftime() routine + * supplied in the missing directory won't be enough, because it relies on the + * tzset() routine from System V / Posix. Fortunately, there is an + * emulation for tzset() too that should do the trick. If you don't + * have tzset(), define this. + */ +/* #define TZSET_MISSING 1 */ + +/* + * TZNAME_MISSING + * + * Some systems do not support the external variables tzname and daylight. + * If this is the case *and* strftime() is missing, define this. + */ +/* #define TZNAME_MISSING 1 */ + +/* + * STDC_HEADERS + * + * If your system does have ANSI compliant header files that + * provide prototypes for library routines, then define this. + */ +#define STDC_HEADERS 1 + +/* + * NO_TOKEN_PASTING + * + * If your compiler define's __STDC__ but does not support token + * pasting (tok##tok), then define this. + */ +/* #define NO_TOKEN_PASTING 1 */ + +/*****************************************************************/ +/* Stuff related to the Standard I/O Library. */ +/*****************************************************************/ +/* Much of this is (still, unfortunately) black magic in nature. */ +/* You may have to use some or all of these together to get gawk */ +/* to work correctly. */ +/*****************************************************************/ + +/* + * NON_STD_SPRINTF + * + * Look in your /usr/include/stdio.h file. If the return type of the + * sprintf() function is NOT `int', define this. + */ +/* #define NON_STD_SPRINTF 1 */ + +/* + * VPRINTF_MISSING + * + * Define this if your system lacks vprintf() and the other routines + * that go with it. This will trigger an attempt to use _doprnt(). + * If you don't have that, this attempt will fail and you are on your own. + */ +/* #define VPRINTF_MISSING 1 */ + +/* + * Casts from size_t to int and back. These will become unnecessary + * at some point in the future, but for now are required where the + * two types are a different representation. + */ +/* #define SZTC */ +/* #define INTC */ + +/* + * SYSTEM_MISSING + * + * Define this if your library does not provide a system function + * or you are not entirely happy with it and would rather use + * a provided replacement (atari only). + */ +/* #define SYSTEM_MISSING 1 */ + +/* + * FMOD_MISSING + * + * Define this if your system lacks the fmod() function and modf() will + * be used instead. + */ +/* #define FMOD_MISSING 1 */ + + +/*******************************/ +/* Gawk configuration options. */ +/*******************************/ + +/* + * DEFPATH + * + * The default search path for the -f option of gawk. It is used + * if the AWKPATH environment variable is undefined. The default + * definition is provided here. Most likely you should not change + * this. + */ + +/* #define DEFPATH ".:/usr/lib/awk:/usr/local/lib/awk" */ +/* #define ENVSEP ':' */ + +/* + * alloca already has a prototype defined - don't redefine it + */ +#define ALLOCA_PROTO 1 + +/* + * srandom already has a prototype defined - don't redefine it + */ +#define SRANDOM_PROTO 1 + +/* anything that follows is for system-specific short-term kludges */ diff --git a/gnu/usr.bin/awk/dfa.c b/gnu/usr.bin/awk/dfa.c new file mode 100644 index 000000000000..5293c755871d --- /dev/null +++ b/gnu/usr.bin/awk/dfa.c @@ -0,0 +1,2291 @@ +/* dfa.c - determinisitic extended regexp routines for GNU + Copyright (C) 1988 Free Software Foundation, Inc. + Written June, 1988 by Mike Haertel + Modified July, 1988 by Arthur David Olson + to assist BMG speedups + + NO WARRANTY + + BECAUSE THIS PROGRAM IS LICENSED FREE OF CHARGE, WE PROVIDE ABSOLUTELY +NO WARRANTY, TO THE EXTENT PERMITTED BY APPLICABLE STATE LAW. EXCEPT +WHEN OTHERWISE STATED IN WRITING, FREE SOFTWARE FOUNDATION, INC, +RICHARD M. STALLMAN AND/OR OTHER PARTIES PROVIDE THIS PROGRAM "AS IS" +WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, +BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND +FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY +AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE +DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR +CORRECTION. + + IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW WILL RICHARD M. +STALLMAN, THE FREE SOFTWARE FOUNDATION, INC., AND/OR ANY OTHER PARTY +WHO MAY MODIFY AND REDISTRIBUTE THIS PROGRAM AS PERMITTED BELOW, BE +LIABLE TO YOU FOR DAMAGES, INCLUDING ANY LOST PROFITS, LOST MONIES, OR +OTHER SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE +USE OR INABILITY TO USE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR +DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY THIRD PARTIES OR +A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS) THIS +PROGRAM, EVEN IF YOU HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH +DAMAGES, OR FOR ANY CLAIM BY ANY OTHER PARTY. + + GENERAL PUBLIC LICENSE TO COPY + + 1. You may copy and distribute verbatim copies of this source file +as you receive it, in any medium, provided that you conspicuously and +appropriately publish on each copy a valid copyright notice "Copyright + (C) 1988 Free Software Foundation, Inc."; and include following the +copyright notice a verbatim copy of the above disclaimer of warranty +and of this License. You may charge a distribution fee for the +physical act of transferring a copy. + + 2. You may modify your copy or copies of this source file or +any portion of it, and copy and distribute such modifications under +the terms of Paragraph 1 above, provided that you also do the following: + + a) cause the modified files to carry prominent notices stating + that you changed the files and the date of any change; and + + b) cause the whole of any work that you distribute or publish, + that in whole or in part contains or is a derivative of this + program or any part thereof, to be licensed at no charge to all + third parties on terms identical to those contained in this + License Agreement (except that you may choose to grant more extensive + warranty protection to some or all third parties, at your option). + + c) You may charge a distribution fee for the physical act of + transferring a copy, and you may at your option offer warranty + protection in exchange for a fee. + +Mere aggregation of another unrelated program with this program (or its +derivative) on a volume of a storage or distribution medium does not bring +the other program under the scope of these terms. + + 3. You may copy and distribute this program or any portion of it in +compiled, executable or object code form under the terms of Paragraphs +1 and 2 above provided that you do the following: + + a) accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of + Paragraphs 1 and 2 above; or, + + b) accompany it with a written offer, valid for at least three + years, to give any third party free (except for a nominal + shipping charge) a complete machine-readable copy of the + corresponding source code, to be distributed under the terms of + Paragraphs 1 and 2 above; or, + + c) accompany it with the information you received as to where the + corresponding source code may be obtained. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form alone.) + +For an executable file, complete source code means all the source code for +all modules it contains; but, as a special exception, it need not include +source code for modules which are standard libraries that accompany the +operating system on which the executable file runs. + + 4. You may not copy, sublicense, distribute or transfer this program +except as expressly provided under this License Agreement. Any attempt +otherwise to copy, sublicense, distribute or transfer this program is void and +your rights to use the program under this License agreement shall be +automatically terminated. However, parties who have received computer +software programs from you with this License Agreement will not have +their licenses terminated so long as such parties remain in full compliance. + + 5. If you wish to incorporate parts of this program into other free +programs whose distribution conditions are different, write to the Free +Software Foundation at 675 Mass Ave, Cambridge, MA 02139. We have not yet +worked out a simple rule that can be stated here, but we will often permit +this. We will be guided by the two goals of preserving the free status of +all derivatives our free software and of promoting the sharing and reuse of +software. + + +In other words, you are welcome to use, share and improve this program. +You are forbidden to forbid anyone else to use, share and improve +what you give them. Help stamp out software-hoarding! */ + +#include "awk.h" +#include <assert.h> + +#ifdef setbit /* surprise - setbit and clrbit are macros on NeXT */ +#undef setbit +#endif +#ifdef clrbit +#undef clrbit +#endif + +#ifdef __STDC__ +typedef void *ptr_t; +#else +typedef char *ptr_t; +#endif + +typedef struct { + char ** in; + char * left; + char * right; + char * is; +} must; + +static ptr_t xcalloc P((int n, size_t s)); +static ptr_t xmalloc P((size_t n)); +static ptr_t xrealloc P((ptr_t p, size_t n)); +static int tstbit P((int b, _charset c)); +static void setbit P((int b, _charset c)); +static void clrbit P((int b, _charset c)); +static void copyset P((const _charset src, _charset dst)); +static void zeroset P((_charset s)); +static void notset P((_charset s)); +static int equal P((const _charset s1, const _charset s2)); +static int charset_index P((const _charset s)); +static _token lex P((void)); +static void addtok P((_token t)); +static void atom P((void)); +static void closure P((void)); +static void branch P((void)); +static void regexp P((void)); +static void copy P((const _position_set *src, _position_set *dst)); +static void insert P((_position p, _position_set *s)); +static void merge P((_position_set *s1, _position_set *s2, _position_set *m)); +static void delete P((_position p, _position_set *s)); +static int state_index P((struct regexp *r, _position_set *s, + int newline, int letter)); +static void epsclosure P((_position_set *s, struct regexp *r)); +static void build_state P((int s, struct regexp *r)); +static void build_state_zero P((struct regexp *r)); +static char *icatalloc P((char *old, const char *new)); +static char *icpyalloc P((const char *string)); +static char *istrstr P((char *lookin, char *lookfor)); +static void ifree P((char *cp)); +static void freelist P((char **cpp)); +static char **enlist P((char **cpp, char *new, size_t len)); +static char **comsubs P((char *left, char *right)); +static char **addlists P((char **old, char **new)); +static char **inboth P((char **left, char **right)); +static void resetmust P((must *mp)); +static void regmust P((struct regexp *r)); + +#undef P + +static ptr_t +xcalloc(n, s) + int n; + size_t s; +{ + ptr_t r = calloc(n, s); + + if (NULL == r) + reg_error("Memory exhausted"); /* reg_error does not return */ + return r; +} + +static ptr_t +xmalloc(n) + size_t n; +{ + ptr_t r = malloc(n); + + assert(n != 0); + if (NULL == r) + reg_error("Memory exhausted"); + return r; +} + +static ptr_t +xrealloc(p, n) + ptr_t p; + size_t n; +{ + ptr_t r = realloc(p, n); + + assert(n != 0); + if (NULL == r) + reg_error("Memory exhausted"); + return r; +} + +#define CALLOC(p, t, n) ((p) = (t *) xcalloc((n), sizeof (t))) +#undef MALLOC +#define MALLOC(p, t, n) ((p) = (t *) xmalloc((n) * sizeof (t))) +#define REALLOC(p, t, n) ((p) = (t *) xrealloc((ptr_t) (p), (n) * sizeof (t))) + +/* Reallocate an array of type t if nalloc is too small for index. */ +#define REALLOC_IF_NECESSARY(p, t, nalloc, index) \ + if ((index) >= (nalloc)) \ + { \ + while ((index) >= (nalloc)) \ + (nalloc) *= 2; \ + REALLOC(p, t, nalloc); \ + } + +/* Stuff pertaining to charsets. */ + +static int +tstbit(b, c) + int b; + _charset c; +{ + return c[b / INTBITS] & 1 << b % INTBITS; +} + +static void +setbit(b, c) + int b; + _charset c; +{ + c[b / INTBITS] |= 1 << b % INTBITS; +} + +static void +clrbit(b, c) + int b; + _charset c; +{ + c[b / INTBITS] &= ~(1 << b % INTBITS); +} + +static void +copyset(src, dst) + const _charset src; + _charset dst; +{ + int i; + + for (i = 0; i < _CHARSET_INTS; ++i) + dst[i] = src[i]; +} + +static void +zeroset(s) + _charset s; +{ + int i; + + for (i = 0; i < _CHARSET_INTS; ++i) + s[i] = 0; +} + +static void +notset(s) + _charset s; +{ + int i; + + for (i = 0; i < _CHARSET_INTS; ++i) + s[i] = ~s[i]; +} + +static int +equal(s1, s2) + const _charset s1; + const _charset s2; +{ + int i; + + for (i = 0; i < _CHARSET_INTS; ++i) + if (s1[i] != s2[i]) + return 0; + return 1; +} + +/* A pointer to the current regexp is kept here during parsing. */ +static struct regexp *reg; + +/* Find the index of charset s in reg->charsets, or allocate a new charset. */ +static int +charset_index(s) + const _charset s; +{ + int i; + + for (i = 0; i < reg->cindex; ++i) + if (equal(s, reg->charsets[i])) + return i; + REALLOC_IF_NECESSARY(reg->charsets, _charset, reg->calloc, reg->cindex); + ++reg->cindex; + copyset(s, reg->charsets[i]); + return i; +} + +/* Syntax bits controlling the behavior of the lexical analyzer. */ +static syntax_bits, syntax_bits_set; + +/* Flag for case-folding letters into sets. */ +static case_fold; + +/* Entry point to set syntax options. */ +void +regsyntax(bits, fold) + long bits; + int fold; +{ + syntax_bits_set = 1; + syntax_bits = bits; + case_fold = fold; +} + +/* Lexical analyzer. */ +static const char *lexstart; /* Pointer to beginning of input string. */ +static const char *lexptr; /* Pointer to next input character. */ +static lexleft; /* Number of characters remaining. */ +static caret_allowed; /* True if backward context allows ^ + (meaningful only if RE_CONTEXT_INDEP_OPS + is turned off). */ +static closure_allowed; /* True if backward context allows closures + (meaningful only if RE_CONTEXT_INDEP_OPS + is turned off). */ + +/* Note that characters become unsigned here. */ +#define FETCH(c, eoferr) \ + { \ + if (! lexleft) \ + if (eoferr != NULL) \ + reg_error(eoferr); \ + else \ + return _END; \ + (c) = (unsigned char) *lexptr++; \ + --lexleft; \ + } + +static _token +lex() +{ + _token c, c2; + int invert; + _charset cset; + + FETCH(c, (char *) 0); + switch (c) + { + case '^': + if (! (syntax_bits & RE_CONTEXT_INDEP_OPS) + && (!caret_allowed || + ((syntax_bits & RE_TIGHT_VBAR) && lexptr - 1 != lexstart))) + goto normal_char; + caret_allowed = 0; + return syntax_bits & RE_TIGHT_VBAR ? _ALLBEGLINE : _BEGLINE; + + case '$': + if (syntax_bits & RE_CONTEXT_INDEP_OPS || !lexleft + || (! (syntax_bits & RE_TIGHT_VBAR) + && ((syntax_bits & RE_NO_BK_PARENS + ? lexleft > 0 && *lexptr == ')' + : lexleft > 1 && *lexptr == '\\' && lexptr[1] == ')') + || (syntax_bits & RE_NO_BK_VBAR + ? lexleft > 0 && *lexptr == '|' + : lexleft > 1 && *lexptr == '\\' && lexptr[1] == '|')))) + return syntax_bits & RE_TIGHT_VBAR ? _ALLENDLINE : _ENDLINE; + goto normal_char; + + case '\\': + FETCH(c, "Unfinished \\ quote"); + switch (c) + { + case '1': + case '2': + case '3': + case '4': + case '5': + case '6': + case '7': + case '8': + case '9': + caret_allowed = 0; + closure_allowed = 1; + return _BACKREF; + + case '<': + caret_allowed = 0; + return _BEGWORD; + + case '>': + caret_allowed = 0; + return _ENDWORD; + + case 'b': + caret_allowed = 0; + return _LIMWORD; + + case 'B': + caret_allowed = 0; + return _NOTLIMWORD; + + case 'w': + case 'W': + zeroset(cset); + for (c2 = 0; c2 < _NOTCHAR; ++c2) + if (ISALNUM(c2)) + setbit(c2, cset); + if (c == 'W') + notset(cset); + caret_allowed = 0; + closure_allowed = 1; + return _SET + charset_index(cset); + + case '?': + if (syntax_bits & RE_BK_PLUS_QM) + goto qmark; + goto normal_char; + + case '+': + if (syntax_bits & RE_BK_PLUS_QM) + goto plus; + goto normal_char; + + case '|': + if (! (syntax_bits & RE_NO_BK_VBAR)) + goto or; + goto normal_char; + + case '(': + if (! (syntax_bits & RE_NO_BK_PARENS)) + goto lparen; + goto normal_char; + + case ')': + if (! (syntax_bits & RE_NO_BK_PARENS)) + goto rparen; + goto normal_char; + + default: + goto normal_char; + } + + case '?': + if (syntax_bits & RE_BK_PLUS_QM) + goto normal_char; + qmark: + if (! (syntax_bits & RE_CONTEXT_INDEP_OPS) && !closure_allowed) + goto normal_char; + return _QMARK; + + case '*': + if (! (syntax_bits & RE_CONTEXT_INDEP_OPS) && !closure_allowed) + goto normal_char; + return _STAR; + + case '+': + if (syntax_bits & RE_BK_PLUS_QM) + goto normal_char; + plus: + if (! (syntax_bits & RE_CONTEXT_INDEP_OPS) && !closure_allowed) + goto normal_char; + return _PLUS; + + case '|': + if (! (syntax_bits & RE_NO_BK_VBAR)) + goto normal_char; + or: + caret_allowed = 1; + closure_allowed = 0; + return _OR; + + case '\n': + if (! (syntax_bits & RE_NEWLINE_OR)) + goto normal_char; + goto or; + + case '(': + if (! (syntax_bits & RE_NO_BK_PARENS)) + goto normal_char; + lparen: + caret_allowed = 1; + closure_allowed = 0; + return _LPAREN; + + case ')': + if (! (syntax_bits & RE_NO_BK_PARENS)) + goto normal_char; + rparen: + caret_allowed = 0; + closure_allowed = 1; + return _RPAREN; + + case '.': + zeroset(cset); + notset(cset); + clrbit('\n', cset); + caret_allowed = 0; + closure_allowed = 1; + return _SET + charset_index(cset); + + case '[': + zeroset(cset); + FETCH(c, "Unbalanced ["); + if (c == '^') + { + FETCH(c, "Unbalanced ["); + invert = 1; + } + else + invert = 0; + do + { + FETCH(c2, "Unbalanced ["); + if ((syntax_bits & RE_AWK_CLASS_HACK) && c == '\\') + { + c = c2; + FETCH(c2, "Unbalanced ["); + } + if (c2 == '-') + { + FETCH(c2, "Unbalanced ["); + if (c2 == ']' && (syntax_bits & RE_AWK_CLASS_HACK)) + { + setbit(c, cset); + setbit('-', cset); + break; + } + while (c <= c2) + setbit(c++, cset); + FETCH(c, "Unbalanced ["); + } + else + { + setbit(c, cset); + c = c2; + } + } + while (c != ']'); + if (invert) + notset(cset); + caret_allowed = 0; + closure_allowed = 1; + return _SET + charset_index(cset); + + default: + normal_char: + caret_allowed = 0; + closure_allowed = 1; + if (case_fold && ISALPHA(c)) + { + zeroset(cset); + if (isupper(c)) + c = tolower(c); + setbit(c, cset); + setbit(toupper(c), cset); + return _SET + charset_index(cset); + } + return c; + } +} + +/* Recursive descent parser for regular expressions. */ + +static _token tok; /* Lookahead token. */ +static depth; /* Current depth of a hypothetical stack + holding deferred productions. This is + used to determine the depth that will be + required of the real stack later on in + reganalyze(). */ + +/* Add the given token to the parse tree, maintaining the depth count and + updating the maximum depth if necessary. */ +static void +addtok(t) + _token t; +{ + REALLOC_IF_NECESSARY(reg->tokens, _token, reg->talloc, reg->tindex); + reg->tokens[reg->tindex++] = t; + + switch (t) + { + case _QMARK: + case _STAR: + case _PLUS: + break; + + case _CAT: + case _OR: + --depth; + break; + + default: + ++reg->nleaves; + case _EMPTY: + ++depth; + break; + } + if (depth > reg->depth) + reg->depth = depth; +} + +/* The grammar understood by the parser is as follows. + + start: + regexp + _ALLBEGLINE regexp + regexp _ALLENDLINE + _ALLBEGLINE regexp _ALLENDLINE + + regexp: + regexp _OR branch + branch + + branch: + branch closure + closure + + closure: + closure _QMARK + closure _STAR + closure _PLUS + atom + + atom: + <normal character> + _SET + _BACKREF + _BEGLINE + _ENDLINE + _BEGWORD + _ENDWORD + _LIMWORD + _NOTLIMWORD + <empty> + + The parser builds a parse tree in postfix form in an array of tokens. */ + +#ifdef __STDC__ +static void regexp(void); +#else +static void regexp(); +#endif + +static void +atom() +{ + if (tok >= 0 && (tok < _NOTCHAR || tok >= _SET || tok == _BACKREF + || tok == _BEGLINE || tok == _ENDLINE || tok == _BEGWORD + || tok == _ENDWORD || tok == _LIMWORD || tok == _NOTLIMWORD)) + { + addtok(tok); + tok = lex(); + } + else if (tok == _LPAREN) + { + tok = lex(); + regexp(); + if (tok != _RPAREN) + reg_error("Unbalanced ("); + tok = lex(); + } + else + addtok(_EMPTY); +} + +static void +closure() +{ + atom(); + while (tok == _QMARK || tok == _STAR || tok == _PLUS) + { + addtok(tok); + tok = lex(); + } +} + +static void +branch() +{ + closure(); + while (tok != _RPAREN && tok != _OR && tok != _ALLENDLINE && tok >= 0) + { + closure(); + addtok(_CAT); + } +} + +static void +regexp() +{ + branch(); + while (tok == _OR) + { + tok = lex(); + branch(); + addtok(_OR); + } +} + +/* Main entry point for the parser. S is a string to be parsed, len is the + length of the string, so s can include NUL characters. R is a pointer to + the struct regexp to parse into. */ +void +regparse(s, len, r) + const char *s; + size_t len; + struct regexp *r; +{ + reg = r; + lexstart = lexptr = s; + lexleft = len; + caret_allowed = 1; + closure_allowed = 0; + + if (! syntax_bits_set) + reg_error("No syntax specified"); + + tok = lex(); + depth = r->depth; + + if (tok == _ALLBEGLINE) + { + addtok(_BEGLINE); + tok = lex(); + regexp(); + addtok(_CAT); + } + else + regexp(); + + if (tok == _ALLENDLINE) + { + addtok(_ENDLINE); + addtok(_CAT); + tok = lex(); + } + + if (tok != _END) + reg_error("Unbalanced )"); + + addtok(_END - r->nregexps); + addtok(_CAT); + + if (r->nregexps) + addtok(_OR); + + ++r->nregexps; +} + +/* Some primitives for operating on sets of positions. */ + +/* Copy one set to another; the destination must be large enough. */ +static void +copy(src, dst) + const _position_set *src; + _position_set *dst; +{ + int i; + + for (i = 0; i < src->nelem; ++i) + dst->elems[i] = src->elems[i]; + dst->nelem = src->nelem; +} + +/* Insert a position in a set. Position sets are maintained in sorted + order according to index. If position already exists in the set with + the same index then their constraints are logically or'd together. + S->elems must point to an array large enough to hold the resulting set. */ +static void +insert(p, s) + _position p; + _position_set *s; +{ + int i; + _position t1, t2; + + for (i = 0; i < s->nelem && p.index < s->elems[i].index; ++i) + ; + if (i < s->nelem && p.index == s->elems[i].index) + s->elems[i].constraint |= p.constraint; + else + { + t1 = p; + ++s->nelem; + while (i < s->nelem) + { + t2 = s->elems[i]; + s->elems[i++] = t1; + t1 = t2; + } + } +} + +/* Merge two sets of positions into a third. The result is exactly as if + the positions of both sets were inserted into an initially empty set. */ +static void +merge(s1, s2, m) + _position_set *s1; + _position_set *s2; + _position_set *m; +{ + int i = 0, j = 0; + + m->nelem = 0; + while (i < s1->nelem && j < s2->nelem) + if (s1->elems[i].index > s2->elems[j].index) + m->elems[m->nelem++] = s1->elems[i++]; + else if (s1->elems[i].index < s2->elems[j].index) + m->elems[m->nelem++] = s2->elems[j++]; + else + { + m->elems[m->nelem] = s1->elems[i++]; + m->elems[m->nelem++].constraint |= s2->elems[j++].constraint; + } + while (i < s1->nelem) + m->elems[m->nelem++] = s1->elems[i++]; + while (j < s2->nelem) + m->elems[m->nelem++] = s2->elems[j++]; +} + +/* Delete a position from a set. */ +static void +delete(p, s) + _position p; + _position_set *s; +{ + int i; + + for (i = 0; i < s->nelem; ++i) + if (p.index == s->elems[i].index) + break; + if (i < s->nelem) + for (--s->nelem; i < s->nelem; ++i) + s->elems[i] = s->elems[i + 1]; +} + +/* Find the index of the state corresponding to the given position set with + the given preceding context, or create a new state if there is no such + state. Newline and letter tell whether we got here on a newline or + letter, respectively. */ +static int +state_index(r, s, newline, letter) + struct regexp *r; + _position_set *s; + int newline; + int letter; +{ + int lhash = 0; + int constraint; + int i, j; + + newline = newline ? 1 : 0; + letter = letter ? 1 : 0; + + for (i = 0; i < s->nelem; ++i) + lhash ^= s->elems[i].index + s->elems[i].constraint; + + /* Try to find a state that exactly matches the proposed one. */ + for (i = 0; i < r->sindex; ++i) + { + if (lhash != r->states[i].hash || s->nelem != r->states[i].elems.nelem + || newline != r->states[i].newline || letter != r->states[i].letter) + continue; + for (j = 0; j < s->nelem; ++j) + if (s->elems[j].constraint + != r->states[i].elems.elems[j].constraint + || s->elems[j].index != r->states[i].elems.elems[j].index) + break; + if (j == s->nelem) + return i; + } + + /* We'll have to create a new state. */ + REALLOC_IF_NECESSARY(r->states, _dfa_state, r->salloc, r->sindex); + r->states[i].hash = lhash; + MALLOC(r->states[i].elems.elems, _position, s->nelem); + copy(s, &r->states[i].elems); + r->states[i].newline = newline; + r->states[i].letter = letter; + r->states[i].backref = 0; + r->states[i].constraint = 0; + r->states[i].first_end = 0; + for (j = 0; j < s->nelem; ++j) + if (r->tokens[s->elems[j].index] < 0) + { + constraint = s->elems[j].constraint; + if (_SUCCEEDS_IN_CONTEXT(constraint, newline, 0, letter, 0) + || _SUCCEEDS_IN_CONTEXT(constraint, newline, 0, letter, 1) + || _SUCCEEDS_IN_CONTEXT(constraint, newline, 1, letter, 0) + || _SUCCEEDS_IN_CONTEXT(constraint, newline, 1, letter, 1)) + r->states[i].constraint |= constraint; + if (! r->states[i].first_end) + r->states[i].first_end = r->tokens[s->elems[j].index]; + } + else if (r->tokens[s->elems[j].index] == _BACKREF) + { + r->states[i].constraint = _NO_CONSTRAINT; + r->states[i].backref = 1; + } + + ++r->sindex; + + return i; +} + +/* Find the epsilon closure of a set of positions. If any position of the set + contains a symbol that matches the empty string in some context, replace + that position with the elements of its follow labeled with an appropriate + constraint. Repeat exhaustively until no funny positions are left. + S->elems must be large enough to hold the result. */ +static void +epsclosure(s, r) + _position_set *s; + struct regexp *r; +{ + int i, j; + int *visited; + _position p, old; + + MALLOC(visited, int, r->tindex); + for (i = 0; i < r->tindex; ++i) + visited[i] = 0; + + for (i = 0; i < s->nelem; ++i) + if (r->tokens[s->elems[i].index] >= _NOTCHAR + && r->tokens[s->elems[i].index] != _BACKREF + && r->tokens[s->elems[i].index] < _SET) + { + old = s->elems[i]; + p.constraint = old.constraint; + delete(s->elems[i], s); + if (visited[old.index]) + { + --i; + continue; + } + visited[old.index] = 1; + switch (r->tokens[old.index]) + { + case _BEGLINE: + p.constraint &= _BEGLINE_CONSTRAINT; + break; + case _ENDLINE: + p.constraint &= _ENDLINE_CONSTRAINT; + break; + case _BEGWORD: + p.constraint &= _BEGWORD_CONSTRAINT; + break; + case _ENDWORD: + p.constraint &= _ENDWORD_CONSTRAINT; + break; + case _LIMWORD: + p.constraint &= _ENDWORD_CONSTRAINT; + break; + case _NOTLIMWORD: + p.constraint &= _NOTLIMWORD_CONSTRAINT; + break; + default: + break; + } + for (j = 0; j < r->follows[old.index].nelem; ++j) + { + p.index = r->follows[old.index].elems[j].index; + insert(p, s); + } + /* Force rescan to start at the beginning. */ + i = -1; + } + + free(visited); +} + +/* Perform bottom-up analysis on the parse tree, computing various functions. + Note that at this point, we're pretending constructs like \< are real + characters rather than constraints on what can follow them. + + Nullable: A node is nullable if it is at the root of a regexp that can + match the empty string. + * _EMPTY leaves are nullable. + * No other leaf is nullable. + * A _QMARK or _STAR node is nullable. + * A _PLUS node is nullable if its argument is nullable. + * A _CAT node is nullable if both its arguments are nullable. + * An _OR node is nullable if either argument is nullable. + + Firstpos: The firstpos of a node is the set of positions (nonempty leaves) + that could correspond to the first character of a string matching the + regexp rooted at the given node. + * _EMPTY leaves have empty firstpos. + * The firstpos of a nonempty leaf is that leaf itself. + * The firstpos of a _QMARK, _STAR, or _PLUS node is the firstpos of its + argument. + * The firstpos of a _CAT node is the firstpos of the left argument, union + the firstpos of the right if the left argument is nullable. + * The firstpos of an _OR node is the union of firstpos of each argument. + + Lastpos: The lastpos of a node is the set of positions that could + correspond to the last character of a string matching the regexp at + the given node. + * _EMPTY leaves have empty lastpos. + * The lastpos of a nonempty leaf is that leaf itself. + * The lastpos of a _QMARK, _STAR, or _PLUS node is the lastpos of its + argument. + * The lastpos of a _CAT node is the lastpos of its right argument, union + the lastpos of the left if the right argument is nullable. + * The lastpos of an _OR node is the union of the lastpos of each argument. + + Follow: The follow of a position is the set of positions that could + correspond to the character following a character matching the node in + a string matching the regexp. At this point we consider special symbols + that match the empty string in some context to be just normal characters. + Later, if we find that a special symbol is in a follow set, we will + replace it with the elements of its follow, labeled with an appropriate + constraint. + * Every node in the firstpos of the argument of a _STAR or _PLUS node is in + the follow of every node in the lastpos. + * Every node in the firstpos of the second argument of a _CAT node is in + the follow of every node in the lastpos of the first argument. + + Because of the postfix representation of the parse tree, the depth-first + analysis is conveniently done by a linear scan with the aid of a stack. + Sets are stored as arrays of the elements, obeying a stack-like allocation + scheme; the number of elements in each set deeper in the stack can be + used to determine the address of a particular set's array. */ +void +reganalyze(r, searchflag) + struct regexp *r; + int searchflag; +{ + int *nullable; /* Nullable stack. */ + int *nfirstpos; /* Element count stack for firstpos sets. */ + _position *firstpos; /* Array where firstpos elements are stored. */ + int *nlastpos; /* Element count stack for lastpos sets. */ + _position *lastpos; /* Array where lastpos elements are stored. */ + int *nalloc; /* Sizes of arrays allocated to follow sets. */ + _position_set tmp; /* Temporary set for merging sets. */ + _position_set merged; /* Result of merging sets. */ + int wants_newline; /* True if some position wants newline info. */ + int *o_nullable; + int *o_nfirst, *o_nlast; + _position *o_firstpos, *o_lastpos; + int i, j; + _position *pos; + + r->searchflag = searchflag; + + MALLOC(nullable, int, r->depth); + o_nullable = nullable; + MALLOC(nfirstpos, int, r->depth); + o_nfirst = nfirstpos; + MALLOC(firstpos, _position, r->nleaves); + o_firstpos = firstpos, firstpos += r->nleaves; + MALLOC(nlastpos, int, r->depth); + o_nlast = nlastpos; + MALLOC(lastpos, _position, r->nleaves); + o_lastpos = lastpos, lastpos += r->nleaves; + MALLOC(nalloc, int, r->tindex); + for (i = 0; i < r->tindex; ++i) + nalloc[i] = 0; + MALLOC(merged.elems, _position, r->nleaves); + + CALLOC(r->follows, _position_set, r->tindex); + + for (i = 0; i < r->tindex; ++i) + switch (r->tokens[i]) + { + case _EMPTY: + /* The empty set is nullable. */ + *nullable++ = 1; + + /* The firstpos and lastpos of the empty leaf are both empty. */ + *nfirstpos++ = *nlastpos++ = 0; + break; + + case _STAR: + case _PLUS: + /* Every element in the firstpos of the argument is in the follow + of every element in the lastpos. */ + tmp.nelem = nfirstpos[-1]; + tmp.elems = firstpos; + pos = lastpos; + for (j = 0; j < nlastpos[-1]; ++j) + { + merge(&tmp, &r->follows[pos[j].index], &merged); + REALLOC_IF_NECESSARY(r->follows[pos[j].index].elems, _position, + nalloc[pos[j].index], merged.nelem - 1); + copy(&merged, &r->follows[pos[j].index]); + } + + case _QMARK: + /* A _QMARK or _STAR node is automatically nullable. */ + if (r->tokens[i] != _PLUS) + nullable[-1] = 1; + break; + + case _CAT: + /* Every element in the firstpos of the second argument is in the + follow of every element in the lastpos of the first argument. */ + tmp.nelem = nfirstpos[-1]; + tmp.elems = firstpos; + pos = lastpos + nlastpos[-1]; + for (j = 0; j < nlastpos[-2]; ++j) + { + merge(&tmp, &r->follows[pos[j].index], &merged); + REALLOC_IF_NECESSARY(r->follows[pos[j].index].elems, _position, + nalloc[pos[j].index], merged.nelem - 1); + copy(&merged, &r->follows[pos[j].index]); + } + + /* The firstpos of a _CAT node is the firstpos of the first argument, + union that of the second argument if the first is nullable. */ + if (nullable[-2]) + nfirstpos[-2] += nfirstpos[-1]; + else + firstpos += nfirstpos[-1]; + --nfirstpos; + + /* The lastpos of a _CAT node is the lastpos of the second argument, + union that of the first argument if the second is nullable. */ + if (nullable[-1]) + nlastpos[-2] += nlastpos[-1]; + else + { + pos = lastpos + nlastpos[-2]; + for (j = nlastpos[-1] - 1; j >= 0; --j) + pos[j] = lastpos[j]; + lastpos += nlastpos[-2]; + nlastpos[-2] = nlastpos[-1]; + } + --nlastpos; + + /* A _CAT node is nullable if both arguments are nullable. */ + nullable[-2] = nullable[-1] && nullable[-2]; + --nullable; + break; + + case _OR: + /* The firstpos is the union of the firstpos of each argument. */ + nfirstpos[-2] += nfirstpos[-1]; + --nfirstpos; + + /* The lastpos is the union of the lastpos of each argument. */ + nlastpos[-2] += nlastpos[-1]; + --nlastpos; + + /* An _OR node is nullable if either argument is nullable. */ + nullable[-2] = nullable[-1] || nullable[-2]; + --nullable; + break; + + default: + /* Anything else is a nonempty position. (Note that special + constructs like \< are treated as nonempty strings here; + an "epsilon closure" effectively makes them nullable later. + Backreferences have to get a real position so we can detect + transitions on them later. But they are nullable. */ + *nullable++ = r->tokens[i] == _BACKREF; + + /* This position is in its own firstpos and lastpos. */ + *nfirstpos++ = *nlastpos++ = 1; + --firstpos, --lastpos; + firstpos->index = lastpos->index = i; + firstpos->constraint = lastpos->constraint = _NO_CONSTRAINT; + + /* Allocate the follow set for this position. */ + nalloc[i] = 1; + MALLOC(r->follows[i].elems, _position, nalloc[i]); + break; + } + + /* For each follow set that is the follow set of a real position, replace + it with its epsilon closure. */ + for (i = 0; i < r->tindex; ++i) + if (r->tokens[i] < _NOTCHAR || r->tokens[i] == _BACKREF + || r->tokens[i] >= _SET) + { + copy(&r->follows[i], &merged); + epsclosure(&merged, r); + if (r->follows[i].nelem < merged.nelem) + REALLOC(r->follows[i].elems, _position, merged.nelem); + copy(&merged, &r->follows[i]); + } + + /* Get the epsilon closure of the firstpos of the regexp. The result will + be the set of positions of state 0. */ + merged.nelem = 0; + for (i = 0; i < nfirstpos[-1]; ++i) + insert(firstpos[i], &merged); + epsclosure(&merged, r); + + /* Check if any of the positions of state 0 will want newline context. */ + wants_newline = 0; + for (i = 0; i < merged.nelem; ++i) + if (_PREV_NEWLINE_DEPENDENT(merged.elems[i].constraint)) + wants_newline = 1; + + /* Build the initial state. */ + r->salloc = 1; + r->sindex = 0; + MALLOC(r->states, _dfa_state, r->salloc); + state_index(r, &merged, wants_newline, 0); + + free(o_nullable); + free(o_nfirst); + free(o_firstpos); + free(o_nlast); + free(o_lastpos); + free(nalloc); + free(merged.elems); +} + +/* Find, for each character, the transition out of state s of r, and store + it in the appropriate slot of trans. + + We divide the positions of s into groups (positions can appear in more + than one group). Each group is labeled with a set of characters that + every position in the group matches (taking into account, if necessary, + preceding context information of s). For each group, find the union + of the its elements' follows. This set is the set of positions of the + new state. For each character in the group's label, set the transition + on this character to be to a state corresponding to the set's positions, + and its associated backward context information, if necessary. + + If we are building a searching matcher, we include the positions of state + 0 in every state. + + The collection of groups is constructed by building an equivalence-class + partition of the positions of s. + + For each position, find the set of characters C that it matches. Eliminate + any characters from C that fail on grounds of backward context. + + Search through the groups, looking for a group whose label L has nonempty + intersection with C. If L - C is nonempty, create a new group labeled + L - C and having the same positions as the current group, and set L to + the intersection of L and C. Insert the position in this group, set + C = C - L, and resume scanning. + + If after comparing with every group there are characters remaining in C, + create a new group labeled with the characters of C and insert this + position in that group. */ +void +regstate(s, r, trans) + int s; + struct regexp *r; + int trans[]; +{ + _position_set grps[_NOTCHAR]; /* As many as will ever be needed. */ + _charset labels[_NOTCHAR]; /* Labels corresponding to the groups. */ + int ngrps = 0; /* Number of groups actually used. */ + _position pos; /* Current position being considered. */ + _charset matches; /* Set of matching characters. */ + int matchesf; /* True if matches is nonempty. */ + _charset intersect; /* Intersection with some label set. */ + int intersectf; /* True if intersect is nonempty. */ + _charset leftovers; /* Stuff in the label that didn't match. */ + int leftoversf; /* True if leftovers is nonempty. */ + static _charset letters; /* Set of characters considered letters. */ + static _charset newline; /* Set of characters that aren't newline. */ + _position_set follows; /* Union of the follows of some group. */ + _position_set tmp; /* Temporary space for merging sets. */ + int state; /* New state. */ + int wants_newline; /* New state wants to know newline context. */ + int state_newline; /* New state on a newline transition. */ + int wants_letter; /* New state wants to know letter context. */ + int state_letter; /* New state on a letter transition. */ + static initialized; /* Flag for static initialization. */ + int i, j, k; + + /* Initialize the set of letters, if necessary. */ + if (! initialized) + { + initialized = 1; + for (i = 0; i < _NOTCHAR; ++i) + if (ISALNUM(i)) + setbit(i, letters); + setbit('\n', newline); + } + + zeroset(matches); + + for (i = 0; i < r->states[s].elems.nelem; ++i) + { + pos = r->states[s].elems.elems[i]; + if (r->tokens[pos.index] >= 0 && r->tokens[pos.index] < _NOTCHAR) + setbit(r->tokens[pos.index], matches); + else if (r->tokens[pos.index] >= _SET) + copyset(r->charsets[r->tokens[pos.index] - _SET], matches); + else + continue; + + /* Some characters may need to be climinated from matches because + they fail in the current context. */ + if (pos.constraint != 0xff) + { + if (! _MATCHES_NEWLINE_CONTEXT(pos.constraint, + r->states[s].newline, 1)) + clrbit('\n', matches); + if (! _MATCHES_NEWLINE_CONTEXT(pos.constraint, + r->states[s].newline, 0)) + for (j = 0; j < _CHARSET_INTS; ++j) + matches[j] &= newline[j]; + if (! _MATCHES_LETTER_CONTEXT(pos.constraint, + r->states[s].letter, 1)) + for (j = 0; j < _CHARSET_INTS; ++j) + matches[j] &= ~letters[j]; + if (! _MATCHES_LETTER_CONTEXT(pos.constraint, + r->states[s].letter, 0)) + for (j = 0; j < _CHARSET_INTS; ++j) + matches[j] &= letters[j]; + + /* If there are no characters left, there's no point in going on. */ + for (j = 0; j < _CHARSET_INTS && !matches[j]; ++j) + ; + if (j == _CHARSET_INTS) + continue; + } + + for (j = 0; j < ngrps; ++j) + { + /* If matches contains a single character only, and the current + group's label doesn't contain that character, go on to the + next group. */ + if (r->tokens[pos.index] >= 0 && r->tokens[pos.index] < _NOTCHAR + && !tstbit(r->tokens[pos.index], labels[j])) + continue; + + /* Check if this group's label has a nonempty intersection with + matches. */ + intersectf = 0; + for (k = 0; k < _CHARSET_INTS; ++k) + (intersect[k] = matches[k] & labels[j][k]) ? intersectf = 1 : 0; + if (! intersectf) + continue; + + /* It does; now find the set differences both ways. */ + leftoversf = matchesf = 0; + for (k = 0; k < _CHARSET_INTS; ++k) + { + /* Even an optimizing compiler can't know this for sure. */ + int match = matches[k], label = labels[j][k]; + + (leftovers[k] = ~match & label) ? leftoversf = 1 : 0; + (matches[k] = match & ~label) ? matchesf = 1 : 0; + } + + /* If there were leftovers, create a new group labeled with them. */ + if (leftoversf) + { + copyset(leftovers, labels[ngrps]); + copyset(intersect, labels[j]); + MALLOC(grps[ngrps].elems, _position, r->nleaves); + copy(&grps[j], &grps[ngrps]); + ++ngrps; + } + + /* Put the position in the current group. Note that there is no + reason to call insert() here. */ + grps[j].elems[grps[j].nelem++] = pos; + + /* If every character matching the current position has been + accounted for, we're done. */ + if (! matchesf) + break; + } + + /* If we've passed the last group, and there are still characters + unaccounted for, then we'll have to create a new group. */ + if (j == ngrps) + { + copyset(matches, labels[ngrps]); + zeroset(matches); + MALLOC(grps[ngrps].elems, _position, r->nleaves); + grps[ngrps].nelem = 1; + grps[ngrps].elems[0] = pos; + ++ngrps; + } + } + + MALLOC(follows.elems, _position, r->nleaves); + MALLOC(tmp.elems, _position, r->nleaves); + + /* If we are a searching matcher, the default transition is to a state + containing the positions of state 0, otherwise the default transition + is to fail miserably. */ + if (r->searchflag) + { + wants_newline = 0; + wants_letter = 0; + for (i = 0; i < r->states[0].elems.nelem; ++i) + { + if (_PREV_NEWLINE_DEPENDENT(r->states[0].elems.elems[i].constraint)) + wants_newline = 1; + if (_PREV_LETTER_DEPENDENT(r->states[0].elems.elems[i].constraint)) + wants_letter = 1; + } + copy(&r->states[0].elems, &follows); + state = state_index(r, &follows, 0, 0); + if (wants_newline) + state_newline = state_index(r, &follows, 1, 0); + else + state_newline = state; + if (wants_letter) + state_letter = state_index(r, &follows, 0, 1); + else + state_letter = state; + for (i = 0; i < _NOTCHAR; ++i) + trans[i] = (ISALNUM(i)) ? state_letter : state ; + trans['\n'] = state_newline; + } + else + for (i = 0; i < _NOTCHAR; ++i) + trans[i] = -1; + + for (i = 0; i < ngrps; ++i) + { + follows.nelem = 0; + + /* Find the union of the follows of the positions of the group. + This is a hideously inefficient loop. Fix it someday. */ + for (j = 0; j < grps[i].nelem; ++j) + for (k = 0; k < r->follows[grps[i].elems[j].index].nelem; ++k) + insert(r->follows[grps[i].elems[j].index].elems[k], &follows); + + /* If we are building a searching matcher, throw in the positions + of state 0 as well. */ + if (r->searchflag) + for (j = 0; j < r->states[0].elems.nelem; ++j) + insert(r->states[0].elems.elems[j], &follows); + + /* Find out if the new state will want any context information. */ + wants_newline = 0; + if (tstbit('\n', labels[i])) + for (j = 0; j < follows.nelem; ++j) + if (_PREV_NEWLINE_DEPENDENT(follows.elems[j].constraint)) + wants_newline = 1; + + wants_letter = 0; + for (j = 0; j < _CHARSET_INTS; ++j) + if (labels[i][j] & letters[j]) + break; + if (j < _CHARSET_INTS) + for (j = 0; j < follows.nelem; ++j) + if (_PREV_LETTER_DEPENDENT(follows.elems[j].constraint)) + wants_letter = 1; + + /* Find the state(s) corresponding to the union of the follows. */ + state = state_index(r, &follows, 0, 0); + if (wants_newline) + state_newline = state_index(r, &follows, 1, 0); + else + state_newline = state; + if (wants_letter) + state_letter = state_index(r, &follows, 0, 1); + else + state_letter = state; + + /* Set the transitions for each character in the current label. */ + for (j = 0; j < _CHARSET_INTS; ++j) + for (k = 0; k < INTBITS; ++k) + if (labels[i][j] & 1 << k) + { + int c = j * INTBITS + k; + + if (c == '\n') + trans[c] = state_newline; + else if (ISALNUM(c)) + trans[c] = state_letter; + else if (c < _NOTCHAR) + trans[c] = state; + } + } + + for (i = 0; i < ngrps; ++i) + free(grps[i].elems); + free(follows.elems); + free(tmp.elems); +} + +/* Some routines for manipulating a compiled regexp's transition tables. + Each state may or may not have a transition table; if it does, and it + is a non-accepting state, then r->trans[state] points to its table. + If it is an accepting state then r->fails[state] points to its table. + If it has no table at all, then r->trans[state] is NULL. + TODO: Improve this comment, get rid of the unnecessary redundancy. */ + +static void +build_state(s, r) + int s; + struct regexp *r; +{ + int *trans; /* The new transition table. */ + int i; + + /* Set an upper limit on the number of transition tables that will ever + exist at once. 1024 is arbitrary. The idea is that the frequently + used transition tables will be quickly rebuilt, whereas the ones that + were only needed once or twice will be cleared away. */ + if (r->trcount >= 1024) + { + for (i = 0; i < r->tralloc; ++i) + if (r->trans[i]) + { + free((ptr_t) r->trans[i]); + r->trans[i] = NULL; + } + else if (r->fails[i]) + { + free((ptr_t) r->fails[i]); + r->fails[i] = NULL; + } + r->trcount = 0; + } + + ++r->trcount; + + /* Set up the success bits for this state. */ + r->success[s] = 0; + if (ACCEPTS_IN_CONTEXT(r->states[s].newline, 1, r->states[s].letter, 0, + s, *r)) + r->success[s] |= 4; + if (ACCEPTS_IN_CONTEXT(r->states[s].newline, 0, r->states[s].letter, 1, + s, *r)) + r->success[s] |= 2; + if (ACCEPTS_IN_CONTEXT(r->states[s].newline, 0, r->states[s].letter, 0, + s, *r)) + r->success[s] |= 1; + + MALLOC(trans, int, _NOTCHAR); + regstate(s, r, trans); + + /* Now go through the new transition table, and make sure that the trans + and fail arrays are allocated large enough to hold a pointer for the + largest state mentioned in the table. */ + for (i = 0; i < _NOTCHAR; ++i) + if (trans[i] >= r->tralloc) + { + int oldalloc = r->tralloc; + + while (trans[i] >= r->tralloc) + r->tralloc *= 2; + REALLOC(r->realtrans, int *, r->tralloc + 1); + r->trans = r->realtrans + 1; + REALLOC(r->fails, int *, r->tralloc); + REALLOC(r->success, int, r->tralloc); + REALLOC(r->newlines, int, r->tralloc); + while (oldalloc < r->tralloc) + { + r->trans[oldalloc] = NULL; + r->fails[oldalloc++] = NULL; + } + } + + /* Keep the newline transition in a special place so we can use it as + a sentinel. */ + r->newlines[s] = trans['\n']; + trans['\n'] = -1; + + if (ACCEPTING(s, *r)) + r->fails[s] = trans; + else + r->trans[s] = trans; +} + +static void +build_state_zero(r) + struct regexp *r; +{ + r->tralloc = 1; + r->trcount = 0; + CALLOC(r->realtrans, int *, r->tralloc + 1); + r->trans = r->realtrans + 1; + CALLOC(r->fails, int *, r->tralloc); + MALLOC(r->success, int, r->tralloc); + MALLOC(r->newlines, int, r->tralloc); + build_state(0, r); +} + +/* Search through a buffer looking for a match to the given struct regexp. + Find the first occurrence of a string matching the regexp in the buffer, + and the shortest possible version thereof. Return a pointer to the first + character after the match, or NULL if none is found. Begin points to + the beginning of the buffer, and end points to the first character after + its end. We store a newline in *end to act as a sentinel, so end had + better point somewhere valid. Newline is a flag indicating whether to + allow newlines to be in the matching string. If count is non- + NULL it points to a place we're supposed to increment every time we + see a newline. Finally, if backref is non-NULL it points to a place + where we're supposed to store a 1 if backreferencing happened and the + match needs to be verified by a backtracking matcher. Otherwise + we store a 0 in *backref. */ +char * +regexecute(r, begin, end, newline, count, backref) + struct regexp *r; + char *begin; + char *end; + int newline; + int *count; + int *backref; +{ + register s, s1, tmp; /* Current state. */ + register unsigned char *p; /* Current input character. */ + register **trans, *t; /* Copy of r->trans so it can be optimized + into a register. */ + static sbit[_NOTCHAR]; /* Table for anding with r->success. */ + static sbit_init; + + if (! sbit_init) + { + int i; + + sbit_init = 1; + for (i = 0; i < _NOTCHAR; ++i) + sbit[i] = (ISALNUM(i)) ? 2 : 1; + sbit['\n'] = 4; + } + + if (! r->tralloc) + build_state_zero(r); + + s = s1 = 0; + p = (unsigned char *) begin; + trans = r->trans; + *end = '\n'; + + for (;;) + { + while ((t = trans[s]) != 0) { /* hand-optimized loop */ + s1 = t[*p++]; + if ((t = trans[s1]) == 0) { + tmp = s ; s = s1 ; s1 = tmp ; /* swap */ + break; + } + s = t[*p++]; + } + + if (s >= 0 && p <= (unsigned char *) end && r->fails[s]) + { + if (r->success[s] & sbit[*p]) + { + if (backref) + *backref = (r->states[s].backref != 0); + return (char *) p; + } + + s1 = s; + s = r->fails[s][*p++]; + continue; + } + + /* If the previous character was a newline, count it. */ + if (count && (char *) p <= end && p[-1] == '\n') + ++*count; + + /* Check if we've run off the end of the buffer. */ + if ((char *) p >= end) + return NULL; + + if (s >= 0) + { + build_state(s, r); + trans = r->trans; + continue; + } + + if (p[-1] == '\n' && newline) + { + s = r->newlines[s1]; + continue; + } + + s = 0; + } +} + +/* Initialize the components of a regexp that the other routines don't + initialize for themselves. */ +void +reginit(r) + struct regexp *r; +{ + r->calloc = 1; + MALLOC(r->charsets, _charset, r->calloc); + r->cindex = 0; + + r->talloc = 1; + MALLOC(r->tokens, _token, r->talloc); + r->tindex = r->depth = r->nleaves = r->nregexps = 0; + + r->searchflag = 0; + r->tralloc = 0; +} + +/* Parse and analyze a single string of the given length. */ +void +regcompile(s, len, r, searchflag) + const char *s; + size_t len; + struct regexp *r; + int searchflag; +{ + if (case_fold) /* dummy folding in service of regmust() */ + { + char *regcopy; + int i; + + regcopy = malloc(len); + if (!regcopy) + reg_error("out of memory"); + + /* This is a complete kludge and could potentially break + \<letter> escapes . . . */ + case_fold = 0; + for (i = 0; i < len; ++i) + if (ISUPPER(s[i])) + regcopy[i] = tolower(s[i]); + else + regcopy[i] = s[i]; + + reginit(r); + r->mustn = 0; + r->must[0] = '\0'; + regparse(regcopy, len, r); + free(regcopy); + regmust(r); + reganalyze(r, searchflag); + case_fold = 1; + reginit(r); + regparse(s, len, r); + reganalyze(r, searchflag); + } + else + { + reginit(r); + regparse(s, len, r); + regmust(r); + reganalyze(r, searchflag); + } +} + +/* Free the storage held by the components of a regexp. */ +void +reg_free(r) + struct regexp *r; +{ + int i; + + free((ptr_t) r->charsets); + free((ptr_t) r->tokens); + for (i = 0; i < r->sindex; ++i) + free((ptr_t) r->states[i].elems.elems); + free((ptr_t) r->states); + for (i = 0; i < r->tindex; ++i) + if (r->follows[i].elems) + free((ptr_t) r->follows[i].elems); + free((ptr_t) r->follows); + for (i = 0; i < r->tralloc; ++i) + if (r->trans[i]) + free((ptr_t) r->trans[i]); + else if (r->fails[i]) + free((ptr_t) r->fails[i]); + if (r->realtrans) + free((ptr_t) r->realtrans); + if (r->fails) + free((ptr_t) r->fails); + if (r->newlines) + free((ptr_t) r->newlines); +} + +/* +Having found the postfix representation of the regular expression, +try to find a long sequence of characters that must appear in any line +containing the r.e. +Finding a "longest" sequence is beyond the scope here; +we take an easy way out and hope for the best. +(Take "(ab|a)b"--please.) + +We do a bottom-up calculation of sequences of characters that must appear +in matches of r.e.'s represented by trees rooted at the nodes of the postfix +representation: + sequences that must appear at the left of the match ("left") + sequences that must appear at the right of the match ("right") + lists of sequences that must appear somewhere in the match ("in") + sequences that must constitute the match ("is") +When we get to the root of the tree, we use one of the longest of its +calculated "in" sequences as our answer. The sequence we find is returned in +r->must (where "r" is the single argument passed to "regmust"); +the length of the sequence is returned in r->mustn. + +The sequences calculated for the various types of node (in pseudo ANSI c) +are shown below. "p" is the operand of unary operators (and the left-hand +operand of binary operators); "q" is the right-hand operand of binary operators +. +"ZERO" means "a zero-length sequence" below. + +Type left right is in +---- ---- ----- -- -- +char c # c # c # c # c + +SET ZERO ZERO ZERO ZERO + +STAR ZERO ZERO ZERO ZERO + +QMARK ZERO ZERO ZERO ZERO + +PLUS p->left p->right ZERO p->in + +CAT (p->is==ZERO)? (q->is==ZERO)? (p->is!=ZERO && p->in plus + p->left : q->right : q->is!=ZERO) ? q->in plus + p->is##q->left p->right##q->is p->is##q->is : p->right##q->left + ZERO + +OR longest common longest common (do p->is and substrings common to + leading trailing q->is have same p->in and q->in + (sub)sequence (sub)sequence length and + of p->left of p->right content) ? + and q->left and q->right p->is : NULL + +If there's anything else we recognize in the tree, all four sequences get set +to zero-length sequences. If there's something we don't recognize in the tree, +we just return a zero-length sequence. + +Break ties in favor of infrequent letters (choosing 'zzz' in preference to +'aaa')? + +And. . .is it here or someplace that we might ponder "optimizations" such as + egrep 'psi|epsilon' -> egrep 'psi' + egrep 'pepsi|epsilon' -> egrep 'epsi' + (Yes, we now find "epsi" as a "string + that must occur", but we might also + simplify the *entire* r.e. being sought +) + grep '[c]' -> grep 'c' + grep '(ab|a)b' -> grep 'ab' + grep 'ab*' -> grep 'a' + grep 'a*b' -> grep 'b' +There are several issues: + Is optimization easy (enough)? + + Does optimization actually accomplish anything, + or is the automaton you get from "psi|epsilon" (for example) + the same as the one you get from "psi" (for example)? + + Are optimizable r.e.'s likely to be used in real-life situations + (something like 'ab*' is probably unlikely; something like is + 'psi|epsilon' is likelier)? +*/ + +static char * +icatalloc(old, new) +char * old; +const char * new; +{ + register char * result; + register int oldsize, newsize; + + newsize = (new == NULL) ? 0 : strlen(new); + if (old == NULL) + oldsize = 0; + else if (newsize == 0) + return old; + else oldsize = strlen(old); + if (old == NULL) + result = (char *) malloc(newsize + 1); + else result = (char *) realloc((void *) old, oldsize + newsize + 1); + if (result != NULL && new != NULL) + (void) strcpy(result + oldsize, new); + return result; +} + +static char * +icpyalloc(string) +const char * string; +{ + return icatalloc((char *) NULL, string); +} + +static char * +istrstr(lookin, lookfor) +char * lookin; +register char * lookfor; +{ + register char * cp; + register int len; + + len = strlen(lookfor); + for (cp = lookin; *cp != '\0'; ++cp) + if (strncmp(cp, lookfor, len) == 0) + return cp; + return NULL; +} + +static void +ifree(cp) +char * cp; +{ + if (cp != NULL) + free(cp); +} + +static void +freelist(cpp) +register char ** cpp; +{ + register int i; + + if (cpp == NULL) + return; + for (i = 0; cpp[i] != NULL; ++i) { + free(cpp[i]); + cpp[i] = NULL; + } +} + +static char ** +enlist(cpp, new, len) +register char ** cpp; +register char * new; +#ifdef __STDC__ +size_t len; +#else +int len; +#endif +{ + register int i, j; + + if (cpp == NULL) + return NULL; + if ((new = icpyalloc(new)) == NULL) { + freelist(cpp); + return NULL; + } + new[len] = '\0'; + /* + ** Is there already something in the list that's new (or longer)? + */ + for (i = 0; cpp[i] != NULL; ++i) + if (istrstr(cpp[i], new) != NULL) { + free(new); + return cpp; + } + /* + ** Eliminate any obsoleted strings. + */ + j = 0; + while (cpp[j] != NULL) + if (istrstr(new, cpp[j]) == NULL) + ++j; + else { + free(cpp[j]); + if (--i == j) + break; + cpp[j] = cpp[i]; + } + /* + ** Add the new string. + */ + cpp = (char **) realloc((char *) cpp, (i + 2) * sizeof *cpp); + if (cpp == NULL) + return NULL; + cpp[i] = new; + cpp[i + 1] = NULL; + return cpp; +} + +/* +** Given pointers to two strings, +** return a pointer to an allocated list of their distinct common substrings. +** Return NULL if something seems wild. +*/ + +static char ** +comsubs(left, right) +char * left; +char * right; +{ + register char ** cpp; + register char * lcp; + register char * rcp; + register int i, len; + + if (left == NULL || right == NULL) + return NULL; + cpp = (char **) malloc(sizeof *cpp); + if (cpp == NULL) + return NULL; + cpp[0] = NULL; + for (lcp = left; *lcp != '\0'; ++lcp) { + len = 0; + rcp = strchr(right, *lcp); + while (rcp != NULL) { + for (i = 1; lcp[i] != '\0' && lcp[i] == rcp[i]; ++i) + ; + if (i > len) + len = i; + rcp = strchr(rcp + 1, *lcp); + } + if (len == 0) + continue; +#ifdef __STDC__ + if ((cpp = enlist(cpp, lcp, (size_t)len)) == NULL) +#else + if ((cpp = enlist(cpp, lcp, len)) == NULL) +#endif + break; + } + return cpp; +} + +static char ** +addlists(old, new) +char ** old; +char ** new; +{ + register int i; + + if (old == NULL || new == NULL) + return NULL; + for (i = 0; new[i] != NULL; ++i) { + old = enlist(old, new[i], strlen(new[i])); + if (old == NULL) + break; + } + return old; +} + +/* +** Given two lists of substrings, +** return a new list giving substrings common to both. +*/ + +static char ** +inboth(left, right) +char ** left; +char ** right; +{ + register char ** both; + register char ** temp; + register int lnum, rnum; + + if (left == NULL || right == NULL) + return NULL; + both = (char **) malloc(sizeof *both); + if (both == NULL) + return NULL; + both[0] = NULL; + for (lnum = 0; left[lnum] != NULL; ++lnum) { + for (rnum = 0; right[rnum] != NULL; ++rnum) { + temp = comsubs(left[lnum], right[rnum]); + if (temp == NULL) { + freelist(both); + return NULL; + } + both = addlists(both, temp); + freelist(temp); + if (both == NULL) + return NULL; + } + } + return both; +} + +/* +typedef struct { + char ** in; + char * left; + char * right; + char * is; +} must; + */ +static void +resetmust(mp) +register must * mp; +{ + mp->left[0] = mp->right[0] = mp->is[0] = '\0'; + freelist(mp->in); +} + +static void +regmust(r) +register struct regexp * r; +{ + register must * musts; + register must * mp; + register char * result = ""; + register int ri; + register int i; + register _token t; + static must must0; + + reg->mustn = 0; + reg->must[0] = '\0'; + musts = (must *) malloc((reg->tindex + 1) * sizeof *musts); + if (musts == NULL) + return; + mp = musts; + for (i = 0; i <= reg->tindex; ++i) + mp[i] = must0; + for (i = 0; i <= reg->tindex; ++i) { + mp[i].in = (char **) malloc(sizeof *mp[i].in); + mp[i].left = malloc(2); + mp[i].right = malloc(2); + mp[i].is = malloc(2); + if (mp[i].in == NULL || mp[i].left == NULL || + mp[i].right == NULL || mp[i].is == NULL) + goto done; + mp[i].left[0] = mp[i].right[0] = mp[i].is[0] = '\0'; + mp[i].in[0] = NULL; + } + for (ri = 0; ri < reg->tindex; ++ri) { + switch (t = reg->tokens[ri]) { + case _ALLBEGLINE: + case _ALLENDLINE: + case _LPAREN: + case _RPAREN: + goto done; /* "cannot happen" */ + case _EMPTY: + case _BEGLINE: + case _ENDLINE: + case _BEGWORD: + case _ENDWORD: + case _LIMWORD: + case _NOTLIMWORD: + case _BACKREF: + resetmust(mp); + break; + case _STAR: + case _QMARK: + if (mp <= musts) + goto done; /* "cannot happen" */ + --mp; + resetmust(mp); + break; + case _OR: + if (mp < &musts[2]) + goto done; /* "cannot happen" */ + { + register char ** new; + register must * lmp; + register must * rmp; + register int j, ln, rn, n; + + rmp = --mp; + lmp = --mp; + /* Guaranteed to be. Unlikely, but. . . */ + if (strcmp(lmp->is, rmp->is) != 0) + lmp->is[0] = '\0'; + /* Left side--easy */ + i = 0; + while (lmp->left[i] != '\0' && + lmp->left[i] == rmp->left[i]) + ++i; + lmp->left[i] = '\0'; + /* Right side */ + ln = strlen(lmp->right); + rn = strlen(rmp->right); + n = ln; + if (n > rn) + n = rn; + for (i = 0; i < n; ++i) + if (lmp->right[ln - i - 1] != + rmp->right[rn - i - 1]) + break; + for (j = 0; j < i; ++j) + lmp->right[j] = + lmp->right[(ln - i) + j]; + lmp->right[j] = '\0'; + new = inboth(lmp->in, rmp->in); + if (new == NULL) + goto done; + freelist(lmp->in); + free((char *) lmp->in); + lmp->in = new; + } + break; + case _PLUS: + if (mp <= musts) + goto done; /* "cannot happen" */ + --mp; + mp->is[0] = '\0'; + break; + case _END: + if (mp != &musts[1]) + goto done; /* "cannot happen" */ + for (i = 0; musts[0].in[i] != NULL; ++i) + if (strlen(musts[0].in[i]) > strlen(result)) + result = musts[0].in[i]; + goto done; + case _CAT: + if (mp < &musts[2]) + goto done; /* "cannot happen" */ + { + register must * lmp; + register must * rmp; + + rmp = --mp; + lmp = --mp; + /* + ** In. Everything in left, plus everything in + ** right, plus catenation of + ** left's right and right's left. + */ + lmp->in = addlists(lmp->in, rmp->in); + if (lmp->in == NULL) + goto done; + if (lmp->right[0] != '\0' && + rmp->left[0] != '\0') { + register char * tp; + + tp = icpyalloc(lmp->right); + if (tp == NULL) + goto done; + tp = icatalloc(tp, rmp->left); + if (tp == NULL) + goto done; + lmp->in = enlist(lmp->in, tp, + strlen(tp)); + free(tp); + if (lmp->in == NULL) + goto done; + } + /* Left-hand */ + if (lmp->is[0] != '\0') { + lmp->left = icatalloc(lmp->left, + rmp->left); + if (lmp->left == NULL) + goto done; + } + /* Right-hand */ + if (rmp->is[0] == '\0') + lmp->right[0] = '\0'; + lmp->right = icatalloc(lmp->right, rmp->right); + if (lmp->right == NULL) + goto done; + /* Guaranteed to be */ + if (lmp->is[0] != '\0' && rmp->is[0] != '\0') { + lmp->is = icatalloc(lmp->is, rmp->is); + if (lmp->is == NULL) + goto done; + } + } + break; + default: + if (t < _END) { + /* "cannot happen" */ + goto done; + } else if (t == '\0') { + /* not on *my* shift */ + goto done; + } else if (t >= _SET) { + /* easy enough */ + resetmust(mp); + } else { + /* plain character */ + resetmust(mp); + mp->is[0] = mp->left[0] = mp->right[0] = t; + mp->is[1] = mp->left[1] = mp->right[1] = '\0'; + mp->in = enlist(mp->in, mp->is, 1); + if (mp->in == NULL) + goto done; + } + break; + } + ++mp; + } +done: + (void) strncpy(reg->must, result, MUST_MAX - 1); + reg->must[MUST_MAX - 1] = '\0'; + reg->mustn = strlen(reg->must); + mp = musts; + for (i = 0; i <= reg->tindex; ++i) { + freelist(mp[i].in); + ifree((char *) mp[i].in); + ifree(mp[i].left); + ifree(mp[i].right); + ifree(mp[i].is); + } + free((char *) mp); +} diff --git a/gnu/usr.bin/awk/dfa.h b/gnu/usr.bin/awk/dfa.h new file mode 100644 index 000000000000..65fc49565a7c --- /dev/null +++ b/gnu/usr.bin/awk/dfa.h @@ -0,0 +1,543 @@ +/* dfa.h - declarations for GNU deterministic regexp compiler + Copyright (C) 1988 Free Software Foundation, Inc. + Written June, 1988 by Mike Haertel + + NO WARRANTY + + BECAUSE THIS PROGRAM IS LICENSED FREE OF CHARGE, WE PROVIDE ABSOLUTELY +NO WARRANTY, TO THE EXTENT PERMITTED BY APPLICABLE STATE LAW. EXCEPT +WHEN OTHERWISE STATED IN WRITING, FREE SOFTWARE FOUNDATION, INC, +RICHARD M. STALLMAN AND/OR OTHER PARTIES PROVIDE THIS PROGRAM "AS IS" +WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, +BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND +FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY +AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE +DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR +CORRECTION. + + IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW WILL RICHARD M. +STALLMAN, THE FREE SOFTWARE FOUNDATION, INC., AND/OR ANY OTHER PARTY +WHO MAY MODIFY AND REDISTRIBUTE THIS PROGRAM AS PERMITTED BELOW, BE +LIABLE TO YOU FOR DAMAGES, INCLUDING ANY LOST PROFITS, LOST MONIES, OR +OTHER SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE +USE OR INABILITY TO USE (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR +DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY THIRD PARTIES OR +A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS) THIS +PROGRAM, EVEN IF YOU HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH +DAMAGES, OR FOR ANY CLAIM BY ANY OTHER PARTY. + + GENERAL PUBLIC LICENSE TO COPY + + 1. You may copy and distribute verbatim copies of this source file +as you receive it, in any medium, provided that you conspicuously and +appropriately publish on each copy a valid copyright notice "Copyright + (C) 1988 Free Software Foundation, Inc."; and include following the +copyright notice a verbatim copy of the above disclaimer of warranty +and of this License. You may charge a distribution fee for the +physical act of transferring a copy. + + 2. You may modify your copy or copies of this source file or +any portion of it, and copy and distribute such modifications under +the terms of Paragraph 1 above, provided that you also do the following: + + a) cause the modified files to carry prominent notices stating + that you changed the files and the date of any change; and + + b) cause the whole of any work that you distribute or publish, + that in whole or in part contains or is a derivative of this + program or any part thereof, to be licensed at no charge to all + third parties on terms identical to those contained in this + License Agreement (except that you may choose to grant more extensive + warranty protection to some or all third parties, at your option). + + c) You may charge a distribution fee for the physical act of + transferring a copy, and you may at your option offer warranty + protection in exchange for a fee. + +Mere aggregation of another unrelated program with this program (or its +derivative) on a volume of a storage or distribution medium does not bring +the other program under the scope of these terms. + + 3. You may copy and distribute this program or any portion of it in +compiled, executable or object code form under the terms of Paragraphs +1 and 2 above provided that you do the following: + + a) accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of + Paragraphs 1 and 2 above; or, + + b) accompany it with a written offer, valid for at least three + years, to give any third party free (except for a nominal + shipping charge) a complete machine-readable copy of the + corresponding source code, to be distributed under the terms of + Paragraphs 1 and 2 above; or, + + c) accompany it with the information you received as to where the + corresponding source code may be obtained. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form alone.) + +For an executable file, complete source code means all the source code for +all modules it contains; but, as a special exception, it need not include +source code for modules which are standard libraries that accompany the +operating system on which the executable file runs. + + 4. You may not copy, sublicense, distribute or transfer this program +except as expressly provided under this License Agreement. Any attempt +otherwise to copy, sublicense, distribute or transfer this program is void and +your rights to use the program under this License agreement shall be +automatically terminated. However, parties who have received computer +software programs from you with this License Agreement will not have +their licenses terminated so long as such parties remain in full compliance. + + 5. If you wish to incorporate parts of this program into other free +programs whose distribution conditions are different, write to the Free +Software Foundation at 675 Mass Ave, Cambridge, MA 02139. We have not yet +worked out a simple rule that can be stated here, but we will often permit +this. We will be guided by the two goals of preserving the free status of +all derivatives our free software and of promoting the sharing and reuse of +software. + + +In other words, you are welcome to use, share and improve this program. +You are forbidden to forbid anyone else to use, share and improve +what you give them. Help stamp out software-hoarding! */ + +#ifdef __STDC__ + +#ifdef SOMEDAY +#define ISALNUM(c) isalnum(c) +#define ISALPHA(c) isalpha(c) +#define ISUPPER(c) isupper(c) +#else +#define ISALNUM(c) (isascii(c) && isalnum(c)) +#define ISALPHA(c) (isascii(c) && isalpha(c)) +#define ISUPPER(c) (isascii(c) && isupper(c)) +#endif + +#else /* ! __STDC__ */ + +#define const + +#define ISALNUM(c) (isascii(c) && isalnum(c)) +#define ISALPHA(c) (isascii(c) && isalpha(c)) +#define ISUPPER(c) (isascii(c) && isupper(c)) + +#endif /* ! __STDC__ */ + +/* 1 means plain parentheses serve as grouping, and backslash + parentheses are needed for literal searching. + 0 means backslash-parentheses are grouping, and plain parentheses + are for literal searching. */ +#define RE_NO_BK_PARENS 1L + +/* 1 means plain | serves as the "or"-operator, and \| is a literal. + 0 means \| serves as the "or"-operator, and | is a literal. */ +#define RE_NO_BK_VBAR (1L << 1) + +/* 0 means plain + or ? serves as an operator, and \+, \? are literals. + 1 means \+, \? are operators and plain +, ? are literals. */ +#define RE_BK_PLUS_QM (1L << 2) + +/* 1 means | binds tighter than ^ or $. + 0 means the contrary. */ +#define RE_TIGHT_VBAR (1L << 3) + +/* 1 means treat \n as an _OR operator + 0 means treat it as a normal character */ +#define RE_NEWLINE_OR (1L << 4) + +/* 0 means that a special characters (such as *, ^, and $) always have + their special meaning regardless of the surrounding context. + 1 means that special characters may act as normal characters in some + contexts. Specifically, this applies to: + ^ - only special at the beginning, or after ( or | + $ - only special at the end, or before ) or | + *, +, ? - only special when not after the beginning, (, or | */ +#define RE_CONTEXT_INDEP_OPS (1L << 5) + +/* 1 means that \ in a character class escapes the next character (typically + a hyphen. It also is overloaded to mean that hyphen at the end of the range + is allowable and means that the hyphen is to be taken literally. */ +#define RE_AWK_CLASS_HACK (1L << 6) + +/* Now define combinations of bits for the standard possibilities. */ +#ifdef notdef +#define RE_SYNTAX_AWK (RE_NO_BK_PARENS | RE_NO_BK_VBAR | RE_CONTEXT_INDEP_OPS) +#define RE_SYNTAX_EGREP (RE_SYNTAX_AWK | RE_NEWLINE_OR) +#define RE_SYNTAX_GREP (RE_BK_PLUS_QM | RE_NEWLINE_OR) +#define RE_SYNTAX_EMACS 0 +#endif + +/* The NULL pointer. */ +#ifndef NULL +#define NULL 0 +#endif + +/* Number of bits in an unsigned char. */ +#ifndef CHARBITS +#define CHARBITS 8 +#endif + +/* First integer value that is greater than any character code. */ +#define _NOTCHAR (1 << CHARBITS) + +/* INTBITS need not be exact, just a lower bound. */ +#ifndef INTBITS +#define INTBITS (CHARBITS * sizeof (int)) +#endif + +/* Number of ints required to hold a bit for every character. */ +#define _CHARSET_INTS ((_NOTCHAR + INTBITS - 1) / INTBITS) + +/* Sets of unsigned characters are stored as bit vectors in arrays of ints. */ +typedef int _charset[_CHARSET_INTS]; + +/* The regexp is parsed into an array of tokens in postfix form. Some tokens + are operators and others are terminal symbols. Most (but not all) of these + codes are returned by the lexical analyzer. */ +#ifdef __STDC__ + +typedef enum +{ + _END = -1, /* _END is a terminal symbol that matches the + end of input; any value of _END or less in + the parse tree is such a symbol. Accepting + states of the DFA are those that would have + a transition on _END. */ + + /* Ordinary character values are terminal symbols that match themselves. */ + + _EMPTY = _NOTCHAR, /* _EMPTY is a terminal symbol that matches + the empty string. */ + + _BACKREF, /* _BACKREF is generated by \<digit>; it + it not completely handled. If the scanner + detects a transition on backref, it returns + a kind of "semi-success" indicating that + the match will have to be verified with + a backtracking matcher. */ + + _BEGLINE, /* _BEGLINE is a terminal symbol that matches + the empty string if it is at the beginning + of a line. */ + + _ALLBEGLINE, /* _ALLBEGLINE is a terminal symbol that + matches the empty string if it is at the + beginning of a line; _ALLBEGLINE applies + to the entire regexp and can only occur + as the first token thereof. _ALLBEGLINE + never appears in the parse tree; a _BEGLINE + is prepended with _CAT to the entire + regexp instead. */ + + _ENDLINE, /* _ENDLINE is a terminal symbol that matches + the empty string if it is at the end of + a line. */ + + _ALLENDLINE, /* _ALLENDLINE is to _ENDLINE as _ALLBEGLINE + is to _BEGLINE. */ + + _BEGWORD, /* _BEGWORD is a terminal symbol that matches + the empty string if it is at the beginning + of a word. */ + + _ENDWORD, /* _ENDWORD is a terminal symbol that matches + the empty string if it is at the end of + a word. */ + + _LIMWORD, /* _LIMWORD is a terminal symbol that matches + the empty string if it is at the beginning + or the end of a word. */ + + _NOTLIMWORD, /* _NOTLIMWORD is a terminal symbol that + matches the empty string if it is not at + the beginning or end of a word. */ + + _QMARK, /* _QMARK is an operator of one argument that + matches zero or one occurences of its + argument. */ + + _STAR, /* _STAR is an operator of one argument that + matches the Kleene closure (zero or more + occurrences) of its argument. */ + + _PLUS, /* _PLUS is an operator of one argument that + matches the positive closure (one or more + occurrences) of its argument. */ + + _CAT, /* _CAT is an operator of two arguments that + matches the concatenation of its + arguments. _CAT is never returned by the + lexical analyzer. */ + + _OR, /* _OR is an operator of two arguments that + matches either of its arguments. */ + + _LPAREN, /* _LPAREN never appears in the parse tree, + it is only a lexeme. */ + + _RPAREN, /* _RPAREN never appears in the parse tree. */ + + _SET /* _SET and (and any value greater) is a + terminal symbol that matches any of a + class of characters. */ +} _token; + +#else /* ! __STDC__ */ + +typedef short _token; + +#define _END -1 +#define _EMPTY _NOTCHAR +#define _BACKREF (_EMPTY + 1) +#define _BEGLINE (_EMPTY + 2) +#define _ALLBEGLINE (_EMPTY + 3) +#define _ENDLINE (_EMPTY + 4) +#define _ALLENDLINE (_EMPTY + 5) +#define _BEGWORD (_EMPTY + 6) +#define _ENDWORD (_EMPTY + 7) +#define _LIMWORD (_EMPTY + 8) +#define _NOTLIMWORD (_EMPTY + 9) +#define _QMARK (_EMPTY + 10) +#define _STAR (_EMPTY + 11) +#define _PLUS (_EMPTY + 12) +#define _CAT (_EMPTY + 13) +#define _OR (_EMPTY + 14) +#define _LPAREN (_EMPTY + 15) +#define _RPAREN (_EMPTY + 16) +#define _SET (_EMPTY + 17) + +#endif /* ! __STDC__ */ + +/* Sets are stored in an array in the compiled regexp; the index of the + array corresponding to a given set token is given by _SET_INDEX(t). */ +#define _SET_INDEX(t) ((t) - _SET) + +/* Sometimes characters can only be matched depending on the surrounding + context. Such context decisions depend on what the previous character + was, and the value of the current (lookahead) character. Context + dependent constraints are encoded as 8 bit integers. Each bit that + is set indicates that the constraint succeeds in the corresponding + context. + + bit 7 - previous and current are newlines + bit 6 - previous was newline, current isn't + bit 5 - previous wasn't newline, current is + bit 4 - neither previous nor current is a newline + bit 3 - previous and current are word-constituents + bit 2 - previous was word-constituent, current isn't + bit 1 - previous wasn't word-constituent, current is + bit 0 - neither previous nor current is word-constituent + + Word-constituent characters are those that satisfy isalnum(). + + The macro _SUCCEEDS_IN_CONTEXT determines whether a a given constraint + succeeds in a particular context. Prevn is true if the previous character + was a newline, currn is true if the lookahead character is a newline. + Prevl and currl similarly depend upon whether the previous and current + characters are word-constituent letters. */ +#define _MATCHES_NEWLINE_CONTEXT(constraint, prevn, currn) \ + ((constraint) & (1 << (((prevn) ? 2 : 0) + ((currn) ? 1 : 0) + 4))) +#define _MATCHES_LETTER_CONTEXT(constraint, prevl, currl) \ + ((constraint) & (1 << (((prevl) ? 2 : 0) + ((currl) ? 1 : 0)))) +#define _SUCCEEDS_IN_CONTEXT(constraint, prevn, currn, prevl, currl) \ + (_MATCHES_NEWLINE_CONTEXT(constraint, prevn, currn) \ + && _MATCHES_LETTER_CONTEXT(constraint, prevl, currl)) + +/* The following macros give information about what a constraint depends on. */ +#define _PREV_NEWLINE_DEPENDENT(constraint) \ + (((constraint) & 0xc0) >> 2 != ((constraint) & 0x30)) +#define _PREV_LETTER_DEPENDENT(constraint) \ + (((constraint) & 0x0c) >> 2 != ((constraint) & 0x03)) + +/* Tokens that match the empty string subject to some constraint actually + work by applying that constraint to determine what may follow them, + taking into account what has gone before. The following values are + the constraints corresponding to the special tokens previously defined. */ +#define _NO_CONSTRAINT 0xff +#define _BEGLINE_CONSTRAINT 0xcf +#define _ENDLINE_CONSTRAINT 0xaf +#define _BEGWORD_CONSTRAINT 0xf2 +#define _ENDWORD_CONSTRAINT 0xf4 +#define _LIMWORD_CONSTRAINT 0xf6 +#define _NOTLIMWORD_CONSTRAINT 0xf9 + +/* States of the recognizer correspond to sets of positions in the parse + tree, together with the constraints under which they may be matched. + So a position is encoded as an index into the parse tree together with + a constraint. */ +typedef struct +{ + unsigned index; /* Index into the parse array. */ + unsigned constraint; /* Constraint for matching this position. */ +} _position; + +/* Sets of positions are stored as arrays. */ +typedef struct +{ + _position *elems; /* Elements of this position set. */ + int nelem; /* Number of elements in this set. */ +} _position_set; + +/* A state of the regexp consists of a set of positions, some flags, + and the token value of the lowest-numbered position of the state that + contains an _END token. */ +typedef struct +{ + int hash; /* Hash of the positions of this state. */ + _position_set elems; /* Positions this state could match. */ + char newline; /* True if previous state matched newline. */ + char letter; /* True if previous state matched a letter. */ + char backref; /* True if this state matches a \<digit>. */ + unsigned char constraint; /* Constraint for this state to accept. */ + int first_end; /* Token value of the first _END in elems. */ +} _dfa_state; + +/* If an r.e. is at most MUST_MAX characters long, we look for a string which + must appear in it; whatever's found is dropped into the struct reg. */ + +#define MUST_MAX 50 + +/* A compiled regular expression. */ +struct regexp +{ + /* Stuff built by the scanner. */ + _charset *charsets; /* Array of character sets for _SET tokens. */ + int cindex; /* Index for adding new charsets. */ + int calloc; /* Number of charsets currently allocated. */ + + /* Stuff built by the parser. */ + _token *tokens; /* Postfix parse array. */ + int tindex; /* Index for adding new tokens. */ + int talloc; /* Number of tokens currently allocated. */ + int depth; /* Depth required of an evaluation stack + used for depth-first traversal of the + parse tree. */ + int nleaves; /* Number of leaves on the parse tree. */ + int nregexps; /* Count of parallel regexps being built + with regparse(). */ + + /* Stuff owned by the state builder. */ + _dfa_state *states; /* States of the regexp. */ + int sindex; /* Index for adding new states. */ + int salloc; /* Number of states currently allocated. */ + + /* Stuff built by the structure analyzer. */ + _position_set *follows; /* Array of follow sets, indexed by position + index. The follow of a position is the set + of positions containing characters that + could conceivably follow a character + matching the given position in a string + matching the regexp. Allocated to the + maximum possible position index. */ + int searchflag; /* True if we are supposed to build a searching + as opposed to an exact matcher. A searching + matcher finds the first and shortest string + matching a regexp anywhere in the buffer, + whereas an exact matcher finds the longest + string matching, but anchored to the + beginning of the buffer. */ + + /* Stuff owned by the executor. */ + int tralloc; /* Number of transition tables that have + slots so far. */ + int trcount; /* Number of transition tables that have + actually been built. */ + int **trans; /* Transition tables for states that can + never accept. If the transitions for a + state have not yet been computed, or the + state could possibly accept, its entry in + this table is NULL. */ + int **realtrans; /* Trans always points to realtrans + 1; this + is so trans[-1] can contain NULL. */ + int **fails; /* Transition tables after failing to accept + on a state that potentially could do so. */ + int *success; /* Table of acceptance conditions used in + regexecute and computed in build_state. */ + int *newlines; /* Transitions on newlines. The entry for a + newline in any transition table is always + -1 so we can count lines without wasting + too many cycles. The transition for a + newline is stored separately and handled + as a special case. Newline is also used + as a sentinel at the end of the buffer. */ + char must[MUST_MAX]; + int mustn; +}; + +/* Some macros for user access to regexp internals. */ + +/* ACCEPTING returns true if s could possibly be an accepting state of r. */ +#define ACCEPTING(s, r) ((r).states[s].constraint) + +/* ACCEPTS_IN_CONTEXT returns true if the given state accepts in the + specified context. */ +#define ACCEPTS_IN_CONTEXT(prevn, currn, prevl, currl, state, reg) \ + _SUCCEEDS_IN_CONTEXT((reg).states[state].constraint, \ + prevn, currn, prevl, currl) + +/* FIRST_MATCHING_REGEXP returns the index number of the first of parallel + regexps that a given state could accept. Parallel regexps are numbered + starting at 1. */ +#define FIRST_MATCHING_REGEXP(state, reg) (-(reg).states[state].first_end) + +/* Entry points. */ + +#ifdef __STDC__ + +/* Regsyntax() takes two arguments; the first sets the syntax bits described + earlier in this file, and the second sets the case-folding flag. */ +extern void regsyntax(long, int); + +/* Compile the given string of the given length into the given struct regexp. + Final argument is a flag specifying whether to build a searching or an + exact matcher. */ +extern void regcompile(const char *, size_t, struct regexp *, int); + +/* Execute the given struct regexp on the buffer of characters. The + first char * points to the beginning, and the second points to the + first character after the end of the buffer, which must be a writable + place so a sentinel end-of-buffer marker can be stored there. The + second-to-last argument is a flag telling whether to allow newlines to + be part of a string matching the regexp. The next-to-last argument, + if non-NULL, points to a place to increment every time we see a + newline. The final argument, if non-NULL, points to a flag that will + be set if further examination by a backtracking matcher is needed in + order to verify backreferencing; otherwise the flag will be cleared. + Returns NULL if no match is found, or a pointer to the first + character after the first & shortest matching string in the buffer. */ +extern char *regexecute(struct regexp *, char *, char *, int, int *, int *); + +/* Free the storage held by the components of a struct regexp. */ +extern void reg_free(struct regexp *); + +/* Entry points for people who know what they're doing. */ + +/* Initialize the components of a struct regexp. */ +extern void reginit(struct regexp *); + +/* Incrementally parse a string of given length into a struct regexp. */ +extern void regparse(const char *, size_t, struct regexp *); + +/* Analyze a parsed regexp; second argument tells whether to build a searching + or an exact matcher. */ +extern void reganalyze(struct regexp *, int); + +/* Compute, for each possible character, the transitions out of a given + state, storing them in an array of integers. */ +extern void regstate(int, struct regexp *, int []); + +/* Error handling. */ + +/* Regerror() is called by the regexp routines whenever an error occurs. It + takes a single argument, a NUL-terminated string describing the error. + The default reg_error() prints the error message to stderr and exits. + The user can provide a different reg_free() if so desired. */ +extern void reg_error(const char *); + +#else /* ! __STDC__ */ +extern void regsyntax(), regcompile(), reg_free(), reginit(), regparse(); +extern void reganalyze(), regstate(), reg_error(); +extern char *regexecute(); +#endif diff --git a/gnu/usr.bin/awk/eval.c b/gnu/usr.bin/awk/eval.c new file mode 100644 index 000000000000..f640f3733ada --- /dev/null +++ b/gnu/usr.bin/awk/eval.c @@ -0,0 +1,1225 @@ +/* + * eval.c - gawk parse tree interpreter + */ + +/* + * Copyright (C) 1986, 1988, 1989, 1991, 1992 the Free Software Foundation, Inc. + * + * This file is part of GAWK, the GNU implementation of the + * AWK Progamming Language. + * + * GAWK is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * GAWK is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with GAWK; see the file COPYING. If not, write to + * the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +#include "awk.h" + +extern double pow P((double x, double y)); +extern double modf P((double x, double *yp)); +extern double fmod P((double x, double y)); + +static int eval_condition P((NODE *tree)); +static NODE *op_assign P((NODE *tree)); +static NODE *func_call P((NODE *name, NODE *arg_list)); +static NODE *match_op P((NODE *tree)); + +NODE *_t; /* used as a temporary in macros */ +#ifdef MSDOS +double _msc51bug; /* to get around a bug in MSC 5.1 */ +#endif +NODE *ret_node; +int OFSlen; +int ORSlen; +int OFMTidx; +int CONVFMTidx; + +/* Macros and variables to save and restore function and loop bindings */ +/* + * the val variable allows return/continue/break-out-of-context to be + * caught and diagnosed + */ +#define PUSH_BINDING(stack, x, val) (memcpy ((char *)(stack), (char *)(x), sizeof (jmp_buf)), val++) +#define RESTORE_BINDING(stack, x, val) (memcpy ((char *)(x), (char *)(stack), sizeof (jmp_buf)), val--) + +static jmp_buf loop_tag; /* always the current binding */ +static int loop_tag_valid = 0; /* nonzero when loop_tag valid */ +static int func_tag_valid = 0; +static jmp_buf func_tag; +extern int exiting, exit_val; + +/* + * This table is used by the regexp routines to do case independant + * matching. Basically, every ascii character maps to itself, except + * uppercase letters map to lower case ones. This table has 256 + * entries, which may be overkill. Note also that if the system this + * is compiled on doesn't use 7-bit ascii, casetable[] should not be + * defined to the linker, so gawk should not load. + * + * Do NOT make this array static, it is used in several spots, not + * just in this file. + */ +#if 'a' == 97 /* it's ascii */ +char casetable[] = { + '\000', '\001', '\002', '\003', '\004', '\005', '\006', '\007', + '\010', '\011', '\012', '\013', '\014', '\015', '\016', '\017', + '\020', '\021', '\022', '\023', '\024', '\025', '\026', '\027', + '\030', '\031', '\032', '\033', '\034', '\035', '\036', '\037', + /* ' ' '!' '"' '#' '$' '%' '&' ''' */ + '\040', '\041', '\042', '\043', '\044', '\045', '\046', '\047', + /* '(' ')' '*' '+' ',' '-' '.' '/' */ + '\050', '\051', '\052', '\053', '\054', '\055', '\056', '\057', + /* '0' '1' '2' '3' '4' '5' '6' '7' */ + '\060', '\061', '\062', '\063', '\064', '\065', '\066', '\067', + /* '8' '9' ':' ';' '<' '=' '>' '?' */ + '\070', '\071', '\072', '\073', '\074', '\075', '\076', '\077', + /* '@' 'A' 'B' 'C' 'D' 'E' 'F' 'G' */ + '\100', '\141', '\142', '\143', '\144', '\145', '\146', '\147', + /* 'H' 'I' 'J' 'K' 'L' 'M' 'N' 'O' */ + '\150', '\151', '\152', '\153', '\154', '\155', '\156', '\157', + /* 'P' 'Q' 'R' 'S' 'T' 'U' 'V' 'W' */ + '\160', '\161', '\162', '\163', '\164', '\165', '\166', '\167', + /* 'X' 'Y' 'Z' '[' '\' ']' '^' '_' */ + '\170', '\171', '\172', '\133', '\134', '\135', '\136', '\137', + /* '`' 'a' 'b' 'c' 'd' 'e' 'f' 'g' */ + '\140', '\141', '\142', '\143', '\144', '\145', '\146', '\147', + /* 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' */ + '\150', '\151', '\152', '\153', '\154', '\155', '\156', '\157', + /* 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' */ + '\160', '\161', '\162', '\163', '\164', '\165', '\166', '\167', + /* 'x' 'y' 'z' '{' '|' '}' '~' */ + '\170', '\171', '\172', '\173', '\174', '\175', '\176', '\177', + '\200', '\201', '\202', '\203', '\204', '\205', '\206', '\207', + '\210', '\211', '\212', '\213', '\214', '\215', '\216', '\217', + '\220', '\221', '\222', '\223', '\224', '\225', '\226', '\227', + '\230', '\231', '\232', '\233', '\234', '\235', '\236', '\237', + '\240', '\241', '\242', '\243', '\244', '\245', '\246', '\247', + '\250', '\251', '\252', '\253', '\254', '\255', '\256', '\257', + '\260', '\261', '\262', '\263', '\264', '\265', '\266', '\267', + '\270', '\271', '\272', '\273', '\274', '\275', '\276', '\277', + '\300', '\301', '\302', '\303', '\304', '\305', '\306', '\307', + '\310', '\311', '\312', '\313', '\314', '\315', '\316', '\317', + '\320', '\321', '\322', '\323', '\324', '\325', '\326', '\327', + '\330', '\331', '\332', '\333', '\334', '\335', '\336', '\337', + '\340', '\341', '\342', '\343', '\344', '\345', '\346', '\347', + '\350', '\351', '\352', '\353', '\354', '\355', '\356', '\357', + '\360', '\361', '\362', '\363', '\364', '\365', '\366', '\367', + '\370', '\371', '\372', '\373', '\374', '\375', '\376', '\377', +}; +#else +#include "You lose. You will need a translation table for your character set." +#endif + +/* + * Tree is a bunch of rules to run. Returns zero if it hit an exit() + * statement + */ +int +interpret(tree) +register NODE *volatile tree; +{ + jmp_buf volatile loop_tag_stack; /* shallow binding stack for loop_tag */ + static jmp_buf rule_tag; /* tag the rule currently being run, for NEXT + * and EXIT statements. It is static because + * there are no nested rules */ + register NODE *volatile t = NULL; /* temporary */ + NODE **volatile lhs; /* lhs == Left Hand Side for assigns, etc */ + NODE *volatile stable_tree; + int volatile traverse = 1; /* True => loop thru tree (Node_rule_list) */ + + if (tree == NULL) + return 1; + sourceline = tree->source_line; + source = tree->source_file; + switch (tree->type) { + case Node_rule_node: + traverse = 0; /* False => one for-loop iteration only */ + /* FALL THROUGH */ + case Node_rule_list: + for (t = tree; t != NULL; t = t->rnode) { + if (traverse) + tree = t->lnode; + sourceline = tree->source_line; + source = tree->source_file; + switch (setjmp(rule_tag)) { + case 0: /* normal non-jump */ + /* test pattern, if any */ + if (tree->lnode == NULL || + eval_condition(tree->lnode)) + (void) interpret(tree->rnode); + break; + case TAG_CONTINUE: /* NEXT statement */ + return 1; + case TAG_BREAK: + return 0; + default: + cant_happen(); + } + if (!traverse) /* case Node_rule_node */ + break; /* don't loop */ + } + break; + + case Node_statement_list: + for (t = tree; t != NULL; t = t->rnode) + (void) interpret(t->lnode); + break; + + case Node_K_if: + if (eval_condition(tree->lnode)) { + (void) interpret(tree->rnode->lnode); + } else { + (void) interpret(tree->rnode->rnode); + } + break; + + case Node_K_while: + PUSH_BINDING(loop_tag_stack, loop_tag, loop_tag_valid); + + stable_tree = tree; + while (eval_condition(stable_tree->lnode)) { + switch (setjmp(loop_tag)) { + case 0: /* normal non-jump */ + (void) interpret(stable_tree->rnode); + break; + case TAG_CONTINUE: /* continue statement */ + break; + case TAG_BREAK: /* break statement */ + RESTORE_BINDING(loop_tag_stack, loop_tag, loop_tag_valid); + return 1; + default: + cant_happen(); + } + } + RESTORE_BINDING(loop_tag_stack, loop_tag, loop_tag_valid); + break; + + case Node_K_do: + PUSH_BINDING(loop_tag_stack, loop_tag, loop_tag_valid); + stable_tree = tree; + do { + switch (setjmp(loop_tag)) { + case 0: /* normal non-jump */ + (void) interpret(stable_tree->rnode); + break; + case TAG_CONTINUE: /* continue statement */ + break; + case TAG_BREAK: /* break statement */ + RESTORE_BINDING(loop_tag_stack, loop_tag, loop_tag_valid); + return 1; + default: + cant_happen(); + } + } while (eval_condition(stable_tree->lnode)); + RESTORE_BINDING(loop_tag_stack, loop_tag, loop_tag_valid); + break; + + case Node_K_for: + PUSH_BINDING(loop_tag_stack, loop_tag, loop_tag_valid); + (void) interpret(tree->forloop->init); + stable_tree = tree; + while (eval_condition(stable_tree->forloop->cond)) { + switch (setjmp(loop_tag)) { + case 0: /* normal non-jump */ + (void) interpret(stable_tree->lnode); + /* fall through */ + case TAG_CONTINUE: /* continue statement */ + (void) interpret(stable_tree->forloop->incr); + break; + case TAG_BREAK: /* break statement */ + RESTORE_BINDING(loop_tag_stack, loop_tag, loop_tag_valid); + return 1; + default: + cant_happen(); + } + } + RESTORE_BINDING(loop_tag_stack, loop_tag, loop_tag_valid); + break; + + case Node_K_arrayfor: + { + volatile struct search l; /* For array_for */ + Func_ptr after_assign = NULL; + +#define hakvar forloop->init +#define arrvar forloop->incr + PUSH_BINDING(loop_tag_stack, loop_tag, loop_tag_valid); + lhs = get_lhs(tree->hakvar, &after_assign); + t = tree->arrvar; + if (t->type == Node_param_list) + t = stack_ptr[t->param_cnt]; + stable_tree = tree; + for (assoc_scan(t, (struct search *)&l); + l.retval; + assoc_next((struct search *)&l)) { + unref(*((NODE **) lhs)); + *lhs = dupnode(l.retval); + if (after_assign) + (*after_assign)(); + switch (setjmp(loop_tag)) { + case 0: + (void) interpret(stable_tree->lnode); + case TAG_CONTINUE: + break; + + case TAG_BREAK: + RESTORE_BINDING(loop_tag_stack, loop_tag, loop_tag_valid); + return 1; + default: + cant_happen(); + } + } + RESTORE_BINDING(loop_tag_stack, loop_tag, loop_tag_valid); + break; + } + + case Node_K_break: + if (loop_tag_valid == 0) + fatal("unexpected break"); + longjmp(loop_tag, TAG_BREAK); + break; + + case Node_K_continue: + if (loop_tag_valid == 0) { + /* + * AT&T nawk treats continue outside of loops like + * next. Allow it if not posix, and complain if + * lint. + */ + static int warned = 0; + + if (do_lint && ! warned) { + warning("use of `continue' outside of loop is not portable"); + warned = 1; + } + if (do_posix) + fatal("use of `continue' outside of loop is not allowed"); + longjmp(rule_tag, TAG_CONTINUE); + } else + longjmp(loop_tag, TAG_CONTINUE); + break; + + case Node_K_print: + do_print(tree); + break; + + case Node_K_printf: + do_printf(tree); + break; + + case Node_K_delete: + do_delete(tree->lnode, tree->rnode); + break; + + case Node_K_next: + longjmp(rule_tag, TAG_CONTINUE); + break; + + case Node_K_nextfile: + do_nextfile(); + break; + + case Node_K_exit: + /* + * In A,K,&W, p. 49, it says that an exit statement "... + * causes the program to behave as if the end of input had + * occurred; no more input is read, and the END actions, if + * any are executed." This implies that the rest of the rules + * are not done. So we immediately break out of the main loop. + */ + exiting = 1; + if (tree) { + t = tree_eval(tree->lnode); + exit_val = (int) force_number(t); + } + free_temp(t); + longjmp(rule_tag, TAG_BREAK); + break; + + case Node_K_return: + t = tree_eval(tree->lnode); + ret_node = dupnode(t); + free_temp(t); + longjmp(func_tag, TAG_RETURN); + break; + + default: + /* + * Appears to be an expression statement. Throw away the + * value. + */ + if (do_lint && tree->type == Node_var) + warning("statement has no effect"); + t = tree_eval(tree); + free_temp(t); + break; + } + return 1; +} + +/* evaluate a subtree */ + +NODE * +r_tree_eval(tree) +register NODE *tree; +{ + register NODE *r, *t1, *t2; /* return value & temporary subtrees */ + register NODE **lhs; + register int di; + AWKNUM x, x1, x2; + long lx; +#ifdef CRAY + long lx2; +#endif + +#ifdef DEBUG + if (tree == NULL) + return Nnull_string; + if (tree->type == Node_val) { + if (tree->stref <= 0) cant_happen(); + return tree; + } + if (tree->type == Node_var) { + if (tree->var_value->stref <= 0) cant_happen(); + return tree->var_value; + } + if (tree->type == Node_param_list) { + if (stack_ptr[tree->param_cnt] == NULL) + return Nnull_string; + else + return stack_ptr[tree->param_cnt]->var_value; + } +#endif + switch (tree->type) { + case Node_and: + return tmp_number((AWKNUM) (eval_condition(tree->lnode) + && eval_condition(tree->rnode))); + + case Node_or: + return tmp_number((AWKNUM) (eval_condition(tree->lnode) + || eval_condition(tree->rnode))); + + case Node_not: + return tmp_number((AWKNUM) ! eval_condition(tree->lnode)); + + /* Builtins */ + case Node_builtin: + return ((*tree->proc) (tree->subnode)); + + case Node_K_getline: + return (do_getline(tree)); + + case Node_in_array: + return tmp_number((AWKNUM) in_array(tree->lnode, tree->rnode)); + + case Node_func_call: + return func_call(tree->rnode, tree->lnode); + + /* unary operations */ + case Node_NR: + case Node_FNR: + case Node_NF: + case Node_FIELDWIDTHS: + case Node_FS: + case Node_RS: + case Node_field_spec: + case Node_subscript: + case Node_IGNORECASE: + case Node_OFS: + case Node_ORS: + case Node_OFMT: + case Node_CONVFMT: + lhs = get_lhs(tree, (Func_ptr *)0); + return *lhs; + + case Node_var_array: + fatal("attempt to use an array in a scalar context"); + + case Node_unary_minus: + t1 = tree_eval(tree->subnode); + x = -force_number(t1); + free_temp(t1); + return tmp_number(x); + + case Node_cond_exp: + if (eval_condition(tree->lnode)) + return tree_eval(tree->rnode->lnode); + return tree_eval(tree->rnode->rnode); + + case Node_match: + case Node_nomatch: + case Node_regex: + return match_op(tree); + + case Node_func: + fatal("function `%s' called with space between name and (,\n%s", + tree->lnode->param, + "or used in other expression context"); + + /* assignments */ + case Node_assign: + { + Func_ptr after_assign = NULL; + + r = tree_eval(tree->rnode); + lhs = get_lhs(tree->lnode, &after_assign); + if (r != *lhs) { + NODE *save; + + save = *lhs; + *lhs = dupnode(r); + unref(save); + } + free_temp(r); + if (after_assign) + (*after_assign)(); + return *lhs; + } + + case Node_concat: + { +#define STACKSIZE 10 + NODE *stack[STACKSIZE]; + register NODE **sp; + register int len; + char *str; + register char *dest; + + sp = stack; + len = 0; + while (tree->type == Node_concat) { + *sp = force_string(tree_eval(tree->lnode)); + tree = tree->rnode; + len += (*sp)->stlen; + if (++sp == &stack[STACKSIZE-2]) /* one more and NULL */ + break; + } + *sp = force_string(tree_eval(tree)); + len += (*sp)->stlen; + *++sp = NULL; + emalloc(str, char *, len+2, "tree_eval"); + dest = str; + sp = stack; + while (*sp) { + memcpy(dest, (*sp)->stptr, (*sp)->stlen); + dest += (*sp)->stlen; + free_temp(*sp); + sp++; + } + r = make_str_node(str, len, ALREADY_MALLOCED); + r->flags |= TEMP; + } + return r; + + /* other assignment types are easier because they are numeric */ + case Node_preincrement: + case Node_predecrement: + case Node_postincrement: + case Node_postdecrement: + case Node_assign_exp: + case Node_assign_times: + case Node_assign_quotient: + case Node_assign_mod: + case Node_assign_plus: + case Node_assign_minus: + return op_assign(tree); + default: + break; /* handled below */ + } + + /* evaluate subtrees in order to do binary operation, then keep going */ + t1 = tree_eval(tree->lnode); + t2 = tree_eval(tree->rnode); + + switch (tree->type) { + case Node_geq: + case Node_leq: + case Node_greater: + case Node_less: + case Node_notequal: + case Node_equal: + di = cmp_nodes(t1, t2); + free_temp(t1); + free_temp(t2); + switch (tree->type) { + case Node_equal: + return tmp_number((AWKNUM) (di == 0)); + case Node_notequal: + return tmp_number((AWKNUM) (di != 0)); + case Node_less: + return tmp_number((AWKNUM) (di < 0)); + case Node_greater: + return tmp_number((AWKNUM) (di > 0)); + case Node_leq: + return tmp_number((AWKNUM) (di <= 0)); + case Node_geq: + return tmp_number((AWKNUM) (di >= 0)); + default: + cant_happen(); + } + break; + default: + break; /* handled below */ + } + + x1 = force_number(t1); + free_temp(t1); + x2 = force_number(t2); + free_temp(t2); + switch (tree->type) { + case Node_exp: + if ((lx = x2) == x2 && lx >= 0) { /* integer exponent */ + if (lx == 0) + x = 1; + else if (lx == 1) + x = x1; + else { + /* doing it this way should be more precise */ + for (x = x1; --lx; ) + x *= x1; + } + } else + x = pow((double) x1, (double) x2); + return tmp_number(x); + + case Node_times: + return tmp_number(x1 * x2); + + case Node_quotient: + if (x2 == 0) + fatal("division by zero attempted"); +#ifdef _CRAY + /* + * special case for integer division, put in for Cray + */ + lx2 = x2; + if (lx2 == 0) + return tmp_number(x1 / x2); + lx = (long) x1 / lx2; + if (lx * x2 == x1) + return tmp_number((AWKNUM) lx); + else +#endif + return tmp_number(x1 / x2); + + case Node_mod: + if (x2 == 0) + fatal("division by zero attempted in mod"); +#ifndef FMOD_MISSING + return tmp_number(fmod (x1, x2)); +#else + (void) modf(x1 / x2, &x); + return tmp_number(x1 - x * x2); +#endif + + case Node_plus: + return tmp_number(x1 + x2); + + case Node_minus: + return tmp_number(x1 - x2); + + case Node_var_array: + fatal("attempt to use an array in a scalar context"); + + default: + fatal("illegal type (%d) in tree_eval", tree->type); + } + return 0; +} + +/* Is TREE true or false? Returns 0==false, non-zero==true */ +static int +eval_condition(tree) +register NODE *tree; +{ + register NODE *t1; + register int ret; + + if (tree == NULL) /* Null trees are the easiest kinds */ + return 1; + if (tree->type == Node_line_range) { + /* + * Node_line_range is kind of like Node_match, EXCEPT: the + * lnode field (more properly, the condpair field) is a node + * of a Node_cond_pair; whether we evaluate the lnode of that + * node or the rnode depends on the triggered word. More + * precisely: if we are not yet triggered, we tree_eval the + * lnode; if that returns true, we set the triggered word. + * If we are triggered (not ELSE IF, note), we tree_eval the + * rnode, clear triggered if it succeeds, and perform our + * action (regardless of success or failure). We want to be + * able to begin and end on a single input record, so this + * isn't an ELSE IF, as noted above. + */ + if (!tree->triggered) + if (!eval_condition(tree->condpair->lnode)) + return 0; + else + tree->triggered = 1; + /* Else we are triggered */ + if (eval_condition(tree->condpair->rnode)) + tree->triggered = 0; + return 1; + } + + /* + * Could just be J.random expression. in which case, null and 0 are + * false, anything else is true + */ + + t1 = tree_eval(tree); + if (t1->flags & MAYBE_NUM) + (void) force_number(t1); + if (t1->flags & NUMBER) + ret = t1->numbr != 0.0; + else + ret = t1->stlen != 0; + free_temp(t1); + return ret; +} + +/* + * compare two nodes, returning negative, 0, positive + */ +int +cmp_nodes(t1, t2) +register NODE *t1, *t2; +{ + register int ret; + register int len1, len2; + + if (t1 == t2) + return 0; + if (t1->flags & MAYBE_NUM) + (void) force_number(t1); + if (t2->flags & MAYBE_NUM) + (void) force_number(t2); + if ((t1->flags & NUMBER) && (t2->flags & NUMBER)) { + if (t1->numbr == t2->numbr) return 0; + else if (t1->numbr - t2->numbr < 0) return -1; + else return 1; + } + (void) force_string(t1); + (void) force_string(t2); + len1 = t1->stlen; + len2 = t2->stlen; + if (len1 == 0 || len2 == 0) + return len1 - len2; + ret = memcmp(t1->stptr, t2->stptr, len1 <= len2 ? len1 : len2); + return ret == 0 ? len1-len2 : ret; +} + +static NODE * +op_assign(tree) +register NODE *tree; +{ + AWKNUM rval, lval; + NODE **lhs; + AWKNUM t1, t2; + long ltemp; + NODE *tmp; + Func_ptr after_assign = NULL; + + lhs = get_lhs(tree->lnode, &after_assign); + lval = force_number(*lhs); + + /* + * Can't unref *lhs until we know the type; doing so + * too early breaks x += x sorts of things. + */ + switch(tree->type) { + case Node_preincrement: + case Node_predecrement: + unref(*lhs); + *lhs = make_number(lval + + (tree->type == Node_preincrement ? 1.0 : -1.0)); + if (after_assign) + (*after_assign)(); + return *lhs; + + case Node_postincrement: + case Node_postdecrement: + unref(*lhs); + *lhs = make_number(lval + + (tree->type == Node_postincrement ? 1.0 : -1.0)); + if (after_assign) + (*after_assign)(); + return tmp_number(lval); + default: + break; /* handled below */ + } + + tmp = tree_eval(tree->rnode); + rval = force_number(tmp); + free_temp(tmp); + unref(*lhs); + switch(tree->type) { + case Node_assign_exp: + if ((ltemp = rval) == rval) { /* integer exponent */ + if (ltemp == 0) + *lhs = make_number((AWKNUM) 1); + else if (ltemp == 1) + *lhs = make_number(lval); + else { + /* doing it this way should be more precise */ + for (t1 = t2 = lval; --ltemp; ) + t1 *= t2; + *lhs = make_number(t1); + } + } else + *lhs = make_number((AWKNUM) pow((double) lval, (double) rval)); + break; + + case Node_assign_times: + *lhs = make_number(lval * rval); + break; + + case Node_assign_quotient: + if (rval == (AWKNUM) 0) + fatal("division by zero attempted in /="); +#ifdef _CRAY + /* + * special case for integer division, put in for Cray + */ + ltemp = rval; + if (ltemp == 0) { + *lhs = make_number(lval / rval); + break; + } + ltemp = (long) lval / ltemp; + if (ltemp * lval == rval) + *lhs = make_number((AWKNUM) ltemp); + else +#endif + *lhs = make_number(lval / rval); + break; + + case Node_assign_mod: + if (rval == (AWKNUM) 0) + fatal("division by zero attempted in %="); +#ifndef FMOD_MISSING + *lhs = make_number(fmod(lval, rval)); +#else + (void) modf(lval / rval, &t1); + t2 = lval - rval * t1; + *lhs = make_number(t2); +#endif + break; + + case Node_assign_plus: + *lhs = make_number(lval + rval); + break; + + case Node_assign_minus: + *lhs = make_number(lval - rval); + break; + default: + cant_happen(); + } + if (after_assign) + (*after_assign)(); + return *lhs; +} + +NODE **stack_ptr; + +static NODE * +func_call(name, arg_list) +NODE *name; /* name is a Node_val giving function name */ +NODE *arg_list; /* Node_expression_list of calling args. */ +{ + register NODE *arg, *argp, *r; + NODE *n, *f; + jmp_buf volatile func_tag_stack; + jmp_buf volatile loop_tag_stack; + int volatile save_loop_tag_valid = 0; + NODE **volatile save_stack, *save_ret_node; + NODE **volatile local_stack = NULL, **sp; + int count; + extern NODE *ret_node; + + /* + * retrieve function definition node + */ + f = lookup(name->stptr); + if (!f || f->type != Node_func) + fatal("function `%s' not defined", name->stptr); +#ifdef FUNC_TRACE + fprintf(stderr, "function %s called\n", name->stptr); +#endif + count = f->lnode->param_cnt; + if (count) + emalloc(local_stack, NODE **, count*sizeof(NODE *), "func_call"); + sp = local_stack; + + /* + * for each calling arg. add NODE * on stack + */ + for (argp = arg_list; count && argp != NULL; argp = argp->rnode) { + arg = argp->lnode; + getnode(r); + r->type = Node_var; + /* + * call by reference for arrays; see below also + */ + if (arg->type == Node_param_list) + arg = stack_ptr[arg->param_cnt]; + if (arg->type == Node_var_array) + *r = *arg; + else { + n = tree_eval(arg); + r->lnode = dupnode(n); + r->rnode = (NODE *) NULL; + free_temp(n); + } + *sp++ = r; + count--; + } + if (argp != NULL) /* left over calling args. */ + warning( + "function `%s' called with more arguments than declared", + name->stptr); + /* + * add remaining params. on stack with null value + */ + while (count-- > 0) { + getnode(r); + r->type = Node_var; + r->lnode = Nnull_string; + r->rnode = (NODE *) NULL; + *sp++ = r; + } + + /* + * Execute function body, saving context, as a return statement + * will longjmp back here. + * + * Have to save and restore the loop_tag stuff so that a return + * inside a loop in a function body doesn't scrog any loops going + * on in the main program. We save the necessary info in variables + * local to this function so that function nesting works OK. + * We also only bother to save the loop stuff if we're in a loop + * when the function is called. + */ + if (loop_tag_valid) { + int junk = 0; + + save_loop_tag_valid = (volatile int) loop_tag_valid; + PUSH_BINDING(loop_tag_stack, loop_tag, junk); + loop_tag_valid = 0; + } + save_stack = stack_ptr; + stack_ptr = local_stack; + PUSH_BINDING(func_tag_stack, func_tag, func_tag_valid); + save_ret_node = ret_node; + ret_node = Nnull_string; /* default return value */ + if (setjmp(func_tag) == 0) + (void) interpret(f->rnode); + + r = ret_node; + ret_node = (NODE *) save_ret_node; + RESTORE_BINDING(func_tag_stack, func_tag, func_tag_valid); + stack_ptr = (NODE **) save_stack; + + /* + * here, we pop each parameter and check whether + * it was an array. If so, and if the arg. passed in was + * a simple variable, then the value should be copied back. + * This achieves "call-by-reference" for arrays. + */ + sp = local_stack; + count = f->lnode->param_cnt; + for (argp = arg_list; count > 0 && argp != NULL; argp = argp->rnode) { + arg = argp->lnode; + if (arg->type == Node_param_list) + arg = stack_ptr[arg->param_cnt]; + n = *sp++; + if (arg->type == Node_var && n->type == Node_var_array) { + /* should we free arg->var_value ? */ + arg->var_array = n->var_array; + arg->type = Node_var_array; + } + unref(n->lnode); + freenode(n); + count--; + } + while (count-- > 0) { + n = *sp++; + /* if n is an (local) array, all the elements should be freed */ + if (n->type == Node_var_array) { + assoc_clear(n); + free(n->var_array); + } + unref(n->lnode); + freenode(n); + } + if (local_stack) + free((char *) local_stack); + + /* Restore the loop_tag stuff if necessary. */ + if (save_loop_tag_valid) { + int junk = 0; + + loop_tag_valid = (int) save_loop_tag_valid; + RESTORE_BINDING(loop_tag_stack, loop_tag, junk); + } + + if (!(r->flags & PERM)) + r->flags |= TEMP; + return r; +} + +/* + * This returns a POINTER to a node pointer. get_lhs(ptr) is the current + * value of the var, or where to store the var's new value + */ + +NODE ** +get_lhs(ptr, assign) +register NODE *ptr; +Func_ptr *assign; +{ + register NODE **aptr = NULL; + register NODE *n; + + switch (ptr->type) { + case Node_var_array: + fatal("attempt to use an array in a scalar context"); + case Node_var: + aptr = &(ptr->var_value); +#ifdef DEBUG + if (ptr->var_value->stref <= 0) + cant_happen(); +#endif + break; + + case Node_FIELDWIDTHS: + aptr = &(FIELDWIDTHS_node->var_value); + if (assign) + *assign = set_FIELDWIDTHS; + break; + + case Node_RS: + aptr = &(RS_node->var_value); + if (assign) + *assign = set_RS; + break; + + case Node_FS: + aptr = &(FS_node->var_value); + if (assign) + *assign = set_FS; + break; + + case Node_FNR: + unref(FNR_node->var_value); + FNR_node->var_value = make_number((AWKNUM) FNR); + aptr = &(FNR_node->var_value); + if (assign) + *assign = set_FNR; + break; + + case Node_NR: + unref(NR_node->var_value); + NR_node->var_value = make_number((AWKNUM) NR); + aptr = &(NR_node->var_value); + if (assign) + *assign = set_NR; + break; + + case Node_NF: + if (NF == -1) + (void) get_field(HUGE-1, assign); /* parse record */ + unref(NF_node->var_value); + NF_node->var_value = make_number((AWKNUM) NF); + aptr = &(NF_node->var_value); + if (assign) + *assign = set_NF; + break; + + case Node_IGNORECASE: + unref(IGNORECASE_node->var_value); + IGNORECASE_node->var_value = make_number((AWKNUM) IGNORECASE); + aptr = &(IGNORECASE_node->var_value); + if (assign) + *assign = set_IGNORECASE; + break; + + case Node_OFMT: + aptr = &(OFMT_node->var_value); + if (assign) + *assign = set_OFMT; + break; + + case Node_CONVFMT: + aptr = &(CONVFMT_node->var_value); + if (assign) + *assign = set_CONVFMT; + break; + + case Node_ORS: + aptr = &(ORS_node->var_value); + if (assign) + *assign = set_ORS; + break; + + case Node_OFS: + aptr = &(OFS_node->var_value); + if (assign) + *assign = set_OFS; + break; + + case Node_param_list: + aptr = &(stack_ptr[ptr->param_cnt]->var_value); + break; + + case Node_field_spec: + { + int field_num; + + n = tree_eval(ptr->lnode); + field_num = (int) force_number(n); + free_temp(n); + if (field_num < 0) + fatal("attempt to access field %d", field_num); + if (field_num == 0 && field0_valid) { /* short circuit */ + aptr = &fields_arr[0]; + if (assign) + *assign = reset_record; + break; + } + aptr = get_field(field_num, assign); + break; + } + case Node_subscript: + n = ptr->lnode; + if (n->type == Node_param_list) + n = stack_ptr[n->param_cnt]; + aptr = assoc_lookup(n, concat_exp(ptr->rnode)); + break; + + case Node_func: + fatal ("`%s' is a function, assignment is not allowed", + ptr->lnode->param); + default: + cant_happen(); + } + return aptr; +} + +static NODE * +match_op(tree) +register NODE *tree; +{ + register NODE *t1; + register Regexp *rp; + int i; + int match = 1; + + if (tree->type == Node_nomatch) + match = 0; + if (tree->type == Node_regex) + t1 = *get_field(0, (Func_ptr *) 0); + else { + t1 = force_string(tree_eval(tree->lnode)); + tree = tree->rnode; + } + rp = re_update(tree); + i = research(rp, t1->stptr, 0, t1->stlen, 0); + i = (i == -1) ^ (match == 1); + free_temp(t1); + return tmp_number((AWKNUM) i); +} + +void +set_IGNORECASE() +{ + static int warned = 0; + + if ((do_lint || do_unix) && ! warned) { + warned = 1; + warning("IGNORECASE not supported in compatibility mode"); + } + IGNORECASE = (force_number(IGNORECASE_node->var_value) != 0.0); + set_FS(); +} + +void +set_OFS() +{ + OFS = force_string(OFS_node->var_value)->stptr; + OFSlen = OFS_node->var_value->stlen; + OFS[OFSlen] = '\0'; +} + +void +set_ORS() +{ + ORS = force_string(ORS_node->var_value)->stptr; + ORSlen = ORS_node->var_value->stlen; + ORS[ORSlen] = '\0'; +} + +static NODE **fmt_list = NULL; +static int fmt_ok P((NODE *n)); +static int fmt_index P((NODE *n)); + +static int +fmt_ok(n) +NODE *n; +{ + /* to be done later */ + return 1; +} + +static int +fmt_index(n) +NODE *n; +{ + register int ix = 0; + static int fmt_num = 4; + static int fmt_hiwater = 0; + + if (fmt_list == NULL) + emalloc(fmt_list, NODE **, fmt_num*sizeof(*fmt_list), "fmt_index"); + (void) force_string(n); + while (ix < fmt_hiwater) { + if (cmp_nodes(fmt_list[ix], n) == 0) + return ix; + ix++; + } + /* not found */ + n->stptr[n->stlen] = '\0'; + if (!fmt_ok(n)) + warning("bad FMT specification"); + if (fmt_hiwater >= fmt_num) { + fmt_num *= 2; + emalloc(fmt_list, NODE **, fmt_num, "fmt_index"); + } + fmt_list[fmt_hiwater] = dupnode(n); + return fmt_hiwater++; +} + +void +set_OFMT() +{ + OFMTidx = fmt_index(OFMT_node->var_value); + OFMT = fmt_list[OFMTidx]->stptr; +} + +void +set_CONVFMT() +{ + CONVFMTidx = fmt_index(CONVFMT_node->var_value); + CONVFMT = fmt_list[CONVFMTidx]->stptr; +} diff --git a/gnu/usr.bin/awk/field.c b/gnu/usr.bin/awk/field.c new file mode 100644 index 000000000000..d8f9a5455631 --- /dev/null +++ b/gnu/usr.bin/awk/field.c @@ -0,0 +1,645 @@ +/* + * field.c - routines for dealing with fields and record parsing + */ + +/* + * Copyright (C) 1986, 1988, 1989, 1991, 1992 the Free Software Foundation, Inc. + * + * This file is part of GAWK, the GNU implementation of the + * AWK Progamming Language. + * + * GAWK is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * GAWK is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with GAWK; see the file COPYING. If not, write to + * the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +#include "awk.h" + +static int (*parse_field) P((int, char **, int, NODE *, + Regexp *, void (*)(), NODE *)); +static void rebuild_record P((void)); +static int re_parse_field P((int, char **, int, NODE *, + Regexp *, void (*)(), NODE *)); +static int def_parse_field P((int, char **, int, NODE *, + Regexp *, void (*)(), NODE *)); +static int sc_parse_field P((int, char **, int, NODE *, + Regexp *, void (*)(), NODE *)); +static int fw_parse_field P((int, char **, int, NODE *, + Regexp *, void (*)(), NODE *)); +static void set_element P((int, char *, int, NODE *)); +static void grow_fields_arr P((int num)); +static void set_field P((int num, char *str, int len, NODE *dummy)); + + +static Regexp *FS_regexp = NULL; +static char *parse_extent; /* marks where to restart parse of record */ +static int parse_high_water=0; /* field number that we have parsed so far */ +static int nf_high_water = 0; /* size of fields_arr */ +static int resave_fs; +static NODE *save_FS; /* save current value of FS when line is read, + * to be used in deferred parsing + */ + +NODE **fields_arr; /* array of pointers to the field nodes */ +int field0_valid; /* $(>0) has not been changed yet */ +int default_FS; +static NODE **nodes; /* permanent repository of field nodes */ +static int *FIELDWIDTHS = NULL; + +void +init_fields() +{ + NODE *n; + + emalloc(fields_arr, NODE **, sizeof(NODE *), "init_fields"); + emalloc(nodes, NODE **, sizeof(NODE *), "init_fields"); + getnode(n); + *n = *Nnull_string; + fields_arr[0] = nodes[0] = n; + parse_extent = fields_arr[0]->stptr; + save_FS = dupnode(FS_node->var_value); + field0_valid = 1; +} + + +static void +grow_fields_arr(num) +int num; +{ + register int t; + register NODE *n; + + erealloc(fields_arr, NODE **, (num + 1) * sizeof(NODE *), "set_field"); + erealloc(nodes, NODE **, (num+1) * sizeof(NODE *), "set_field"); + for (t = nf_high_water+1; t <= num; t++) { + getnode(n); + *n = *Nnull_string; + fields_arr[t] = nodes[t] = n; + } + nf_high_water = num; +} + +/*ARGSUSED*/ +static void +set_field(num, str, len, dummy) +int num; +char *str; +int len; +NODE *dummy; /* not used -- just to make interface same as set_element */ +{ + register NODE *n; + + if (num > nf_high_water) + grow_fields_arr(num); + n = nodes[num]; + n->stptr = str; + n->stlen = len; + n->flags = (PERM|STR|STRING|MAYBE_NUM); + fields_arr[num] = n; +} + +/* Someone assigned a value to $(something). Fix up $0 to be right */ +static void +rebuild_record() +{ + register int tlen; + register NODE *tmp; + NODE *ofs; + char *ops; + register char *cops; + register NODE **ptr; + register int ofslen; + + tlen = 0; + ofs = force_string(OFS_node->var_value); + ofslen = ofs->stlen; + ptr = &fields_arr[NF]; + while (ptr > &fields_arr[0]) { + tmp = force_string(*ptr); + tlen += tmp->stlen; + ptr--; + } + tlen += (NF - 1) * ofslen; + if (tlen < 0) + tlen = 0; + emalloc(ops, char *, tlen + 2, "fix_fields"); + cops = ops; + ops[0] = '\0'; + for (ptr = &fields_arr[1]; ptr <= &fields_arr[NF]; ptr++) { + tmp = *ptr; + if (tmp->stlen == 1) + *cops++ = tmp->stptr[0]; + else if (tmp->stlen != 0) { + memcpy(cops, tmp->stptr, tmp->stlen); + cops += tmp->stlen; + } + if (ptr != &fields_arr[NF]) { + if (ofslen == 1) + *cops++ = ofs->stptr[0]; + else if (ofslen != 0) { + memcpy(cops, ofs->stptr, ofslen); + cops += ofslen; + } + } + } + tmp = make_str_node(ops, tlen, ALREADY_MALLOCED); + unref(fields_arr[0]); + fields_arr[0] = tmp; + field0_valid = 1; +} + +/* + * setup $0, but defer parsing rest of line until reference is made to $(>0) + * or to NF. At that point, parse only as much as necessary. + */ +void +set_record(buf, cnt, freeold) +char *buf; +int cnt; +int freeold; +{ + register int i; + + NF = -1; + for (i = 1; i <= parse_high_water; i++) { + unref(fields_arr[i]); + } + parse_high_water = 0; + if (freeold) { + unref(fields_arr[0]); + if (resave_fs) { + resave_fs = 0; + unref(save_FS); + save_FS = dupnode(FS_node->var_value); + } + nodes[0]->stptr = buf; + nodes[0]->stlen = cnt; + nodes[0]->stref = 1; + nodes[0]->flags = (STRING|STR|PERM|MAYBE_NUM); + fields_arr[0] = nodes[0]; + } + fields_arr[0]->flags |= MAYBE_NUM; + field0_valid = 1; +} + +void +reset_record() +{ + (void) force_string(fields_arr[0]); + set_record(fields_arr[0]->stptr, fields_arr[0]->stlen, 0); +} + +void +set_NF() +{ + register int i; + + NF = (int) force_number(NF_node->var_value); + if (NF > nf_high_water) + grow_fields_arr(NF); + for (i = parse_high_water + 1; i <= NF; i++) { + unref(fields_arr[i]); + fields_arr[i] = Nnull_string; + } + field0_valid = 0; +} + +/* + * this is called both from get_field() and from do_split() + * via (*parse_field)(). This variation is for when FS is a regular + * expression -- either user-defined or because RS=="" and FS==" " + */ +static int +re_parse_field(up_to, buf, len, fs, rp, set, n) +int up_to; /* parse only up to this field number */ +char **buf; /* on input: string to parse; on output: point to start next */ +int len; +NODE *fs; +Regexp *rp; +void (*set) (); /* routine to set the value of the parsed field */ +NODE *n; +{ + register char *scan = *buf; + register int nf = parse_high_water; + register char *field; + register char *end = scan + len; + + if (up_to == HUGE) + nf = 0; + if (len == 0) + return nf; + + if (*RS == 0 && default_FS) + while (scan < end && isspace(*scan)) + scan++; + field = scan; + while (scan < end + && research(rp, scan, 0, (int)(end - scan), 1) != -1 + && nf < up_to) { + if (REEND(rp, scan) == RESTART(rp, scan)) { /* null match */ + scan++; + if (scan == end) { + (*set)(++nf, field, scan - field, n); + up_to = nf; + break; + } + continue; + } + (*set)(++nf, field, scan + RESTART(rp, scan) - field, n); + scan += REEND(rp, scan); + field = scan; + if (scan == end) /* FS at end of record */ + (*set)(++nf, field, 0, n); + } + if (nf != up_to && scan < end) { + (*set)(++nf, scan, (int)(end - scan), n); + scan = end; + } + *buf = scan; + return (nf); +} + +/* + * this is called both from get_field() and from do_split() + * via (*parse_field)(). This variation is for when FS is a single space + * character. + */ +static int +def_parse_field(up_to, buf, len, fs, rp, set, n) +int up_to; /* parse only up to this field number */ +char **buf; /* on input: string to parse; on output: point to start next */ +int len; +NODE *fs; +Regexp *rp; +void (*set) (); /* routine to set the value of the parsed field */ +NODE *n; +{ + register char *scan = *buf; + register int nf = parse_high_water; + register char *field; + register char *end = scan + len; + char sav; + + if (up_to == HUGE) + nf = 0; + if (len == 0) + return nf; + + /* before doing anything save the char at *end */ + sav = *end; + /* because it will be destroyed now: */ + + *end = ' '; /* sentinel character */ + for (; nf < up_to; scan++) { + /* + * special case: fs is single space, strip leading whitespace + */ + while (scan < end && (*scan == ' ' || *scan == '\t')) + scan++; + if (scan >= end) + break; + field = scan; + while (*scan != ' ' && *scan != '\t') + scan++; + (*set)(++nf, field, (int)(scan - field), n); + if (scan == end) + break; + } + + /* everything done, restore original char at *end */ + *end = sav; + + *buf = scan; + return nf; +} + +/* + * this is called both from get_field() and from do_split() + * via (*parse_field)(). This variation is for when FS is a single character + * other than space. + */ +static int +sc_parse_field(up_to, buf, len, fs, rp, set, n) +int up_to; /* parse only up to this field number */ +char **buf; /* on input: string to parse; on output: point to start next */ +int len; +NODE *fs; +Regexp *rp; +void (*set) (); /* routine to set the value of the parsed field */ +NODE *n; +{ + register char *scan = *buf; + register char fschar; + register int nf = parse_high_water; + register char *field; + register char *end = scan + len; + char sav; + + if (up_to == HUGE) + nf = 0; + if (len == 0) + return nf; + + if (*RS == 0 && fs->stlen == 0) + fschar = '\n'; + else + fschar = fs->stptr[0]; + + /* before doing anything save the char at *end */ + sav = *end; + /* because it will be destroyed now: */ + *end = fschar; /* sentinel character */ + + for (; nf < up_to; scan++) { + field = scan; + while (*scan++ != fschar) + ; + scan--; + (*set)(++nf, field, (int)(scan - field), n); + if (scan == end) + break; + } + + /* everything done, restore original char at *end */ + *end = sav; + + *buf = scan; + return nf; +} + +/* + * this is called both from get_field() and from do_split() + * via (*parse_field)(). This variation is for fields are fixed widths. + */ +static int +fw_parse_field(up_to, buf, len, fs, rp, set, n) +int up_to; /* parse only up to this field number */ +char **buf; /* on input: string to parse; on output: point to start next */ +int len; +NODE *fs; +Regexp *rp; +void (*set) (); /* routine to set the value of the parsed field */ +NODE *n; +{ + register char *scan = *buf; + register int nf = parse_high_water; + register char *end = scan + len; + + if (up_to == HUGE) + nf = 0; + if (len == 0) + return nf; + for (; nf < up_to && (len = FIELDWIDTHS[nf+1]) != -1; ) { + if (len > end - scan) + len = end - scan; + (*set)(++nf, scan, len, n); + scan += len; + } + if (len == -1) + *buf = end; + else + *buf = scan; + return nf; +} + +NODE ** +get_field(requested, assign) +register int requested; +Func_ptr *assign; /* this field is on the LHS of an assign */ +{ + /* + * if requesting whole line but some other field has been altered, + * then the whole line must be rebuilt + */ + if (requested == 0) { + if (!field0_valid) { + /* first, parse remainder of input record */ + if (NF == -1) { + NF = (*parse_field)(HUGE-1, &parse_extent, + fields_arr[0]->stlen - + (parse_extent - fields_arr[0]->stptr), + save_FS, FS_regexp, set_field, + (NODE *)NULL); + parse_high_water = NF; + } + rebuild_record(); + } + if (assign) + *assign = reset_record; + return &fields_arr[0]; + } + + /* assert(requested > 0); */ + + if (assign) + field0_valid = 0; /* $0 needs reconstruction */ + + if (requested <= parse_high_water) /* already parsed this field */ + return &fields_arr[requested]; + + if (NF == -1) { /* have not yet parsed to end of record */ + /* + * parse up to requested fields, calling set_field() for each, + * saving in parse_extent the point where the parse left off + */ + if (parse_high_water == 0) /* starting at the beginning */ + parse_extent = fields_arr[0]->stptr; + parse_high_water = (*parse_field)(requested, &parse_extent, + fields_arr[0]->stlen - (parse_extent-fields_arr[0]->stptr), + save_FS, FS_regexp, set_field, (NODE *)NULL); + + /* + * if we reached the end of the record, set NF to the number of + * fields so far. Note that requested might actually refer to + * a field that is beyond the end of the record, but we won't + * set NF to that value at this point, since this is only a + * reference to the field and NF only gets set if the field + * is assigned to -- this case is handled below + */ + if (parse_extent == fields_arr[0]->stptr + fields_arr[0]->stlen) + NF = parse_high_water; + if (requested == HUGE-1) /* HUGE-1 means set NF */ + requested = parse_high_water; + } + if (parse_high_water < requested) { /* requested beyond end of record */ + if (assign) { /* expand record */ + register int i; + + if (requested > nf_high_water) + grow_fields_arr(requested); + + /* fill in fields that don't exist */ + for (i = parse_high_water + 1; i <= requested; i++) + fields_arr[i] = Nnull_string; + + NF = requested; + parse_high_water = requested; + } else + return &Nnull_string; + } + + return &fields_arr[requested]; +} + +static void +set_element(num, s, len, n) +int num; +char *s; +int len; +NODE *n; +{ + register NODE *it; + + it = make_string(s, len); + it->flags |= MAYBE_NUM; + *assoc_lookup(n, tmp_number((AWKNUM) (num))) = it; +} + +NODE * +do_split(tree) +NODE *tree; +{ + NODE *t1, *t2, *t3, *tmp; + NODE *fs; + char *s; + int (*parseit)P((int, char **, int, NODE *, + Regexp *, void (*)(), NODE *)); + Regexp *rp = NULL; + + t1 = tree_eval(tree->lnode); + t2 = tree->rnode->lnode; + t3 = tree->rnode->rnode->lnode; + + (void) force_string(t1); + + if (t2->type == Node_param_list) + t2 = stack_ptr[t2->param_cnt]; + if (t2->type != Node_var && t2->type != Node_var_array) + fatal("second argument of split is not a variable"); + assoc_clear(t2); + + if (t3->re_flags & FS_DFLT) { + parseit = parse_field; + fs = force_string(FS_node->var_value); + rp = FS_regexp; + } else { + tmp = force_string(tree_eval(t3->re_exp)); + if (tmp->stlen == 1) { + if (tmp->stptr[0] == ' ') + parseit = def_parse_field; + else + parseit = sc_parse_field; + } else { + parseit = re_parse_field; + rp = re_update(t3); + } + fs = tmp; + } + + s = t1->stptr; + tmp = tmp_number((AWKNUM) (*parseit)(HUGE, &s, (int)t1->stlen, + fs, rp, set_element, t2)); + free_temp(t1); + free_temp(t3); + return tmp; +} + +void +set_FS() +{ + NODE *tmp = NULL; + char buf[10]; + NODE *fs; + + buf[0] = '\0'; + default_FS = 0; + if (FS_regexp) { + refree(FS_regexp); + FS_regexp = NULL; + } + fs = force_string(FS_node->var_value); + if (fs->stlen > 1) + parse_field = re_parse_field; + else if (*RS == 0) { + parse_field = sc_parse_field; + if (fs->stlen == 1) { + if (fs->stptr[0] == ' ') { + default_FS = 1; + strcpy(buf, "[ \t\n]+"); + } else if (fs->stptr[0] != '\n') + sprintf(buf, "[%c\n]", fs->stptr[0]); + } + } else { + parse_field = def_parse_field; + if (fs->stptr[0] == ' ' && fs->stlen == 1) + default_FS = 1; + else if (fs->stptr[0] != ' ' && fs->stlen == 1) { + if (IGNORECASE == 0) + parse_field = sc_parse_field; + else + sprintf(buf, "[%c]", fs->stptr[0]); + } + } + if (buf[0]) { + FS_regexp = make_regexp(buf, strlen(buf), IGNORECASE, 1); + parse_field = re_parse_field; + } else if (parse_field == re_parse_field) { + FS_regexp = make_regexp(fs->stptr, fs->stlen, IGNORECASE, 1); + } else + FS_regexp = NULL; + resave_fs = 1; +} + +void +set_RS() +{ + (void) force_string(RS_node->var_value); + RS = RS_node->var_value->stptr; + set_FS(); +} + +void +set_FIELDWIDTHS() +{ + register char *scan; + char *end; + register int i; + static int fw_alloc = 1; + static int warned = 0; + extern double strtod(); + + if (do_lint && ! warned) { + warned = 1; + warning("use of FIELDWIDTHS is a gawk extension"); + } + if (do_unix) /* quick and dirty, does the trick */ + return; + + parse_field = fw_parse_field; + scan = force_string(FIELDWIDTHS_node->var_value)->stptr; + end = scan + 1; + if (FIELDWIDTHS == NULL) + emalloc(FIELDWIDTHS, int *, fw_alloc * sizeof(int), "set_FIELDWIDTHS"); + FIELDWIDTHS[0] = 0; + for (i = 1; ; i++) { + if (i >= fw_alloc) { + fw_alloc *= 2; + erealloc(FIELDWIDTHS, int *, fw_alloc * sizeof(int), "set_FIELDWIDTHS"); + } + FIELDWIDTHS[i] = (int) strtod(scan, &end); + if (end == scan) + break; + scan = end; + } + FIELDWIDTHS[i] = -1; +} diff --git a/gnu/usr.bin/awk/gawk.texi b/gnu/usr.bin/awk/gawk.texi new file mode 100644 index 000000000000..b2802623136d --- /dev/null +++ b/gnu/usr.bin/awk/gawk.texi @@ -0,0 +1,11270 @@ +\input texinfo @c -*-texinfo-*- +@c %**start of header (This is for running Texinfo on a region.) +@setfilename gawk.info +@settitle The GAWK Manual +@c @smallbook +@c %**end of header (This is for running Texinfo on a region.) + +@ifinfo +@synindex fn cp +@synindex vr cp +@end ifinfo +@iftex +@syncodeindex fn cp +@syncodeindex vr cp +@end iftex + +@c If "finalout" is commented out, the printed output will show +@c black boxes that mark lines that are too long. Thus, it is +@c unwise to comment it out when running a master in case there are +@c overfulls which are deemed okay. + +@iftex +@finalout +@end iftex + +@c ===> NOTE! <== +@c Determine the edition number in *four* places by hand: +@c 1. First ifinfo section 2. title page 3. copyright page 4. top node +@c To find the locations, search for !!set + +@ifinfo +This file documents @code{awk}, a program that you can use to select +particular records in a file and perform operations upon them. + +This is Edition 0.15 of @cite{The GAWK Manual}, @* +for the 2.15 version of the GNU implementation @* +of AWK. + +Copyright (C) 1989, 1991, 1992, 1993 Free Software Foundation, Inc. + +Permission is granted to make and distribute verbatim copies of +this manual provided the copyright notice and this permission notice +are preserved on all copies. + +@ignore +Permission is granted to process this file through TeX and print the +results, provided the printed document carries copying permission +notice identical to this one except for the removal of this paragraph +(this paragraph not being relevant to the printed manual). + +@end ignore +Permission is granted to copy and distribute modified versions of this +manual under the conditions for verbatim copying, provided that the entire +resulting derived work is distributed under the terms of a permission +notice identical to this one. + +Permission is granted to copy and distribute translations of this manual +into another language, under the above conditions for modified versions, +except that this permission notice may be stated in a translation approved +by the Foundation. +@end ifinfo + +@setchapternewpage odd + +@c !!set edition, date, version +@titlepage +@title The GAWK Manual +@subtitle Edition 0.15 +@subtitle April 1993 +@author Diane Barlow Close +@author Arnold D. Robbins +@author Paul H. Rubin +@author Richard Stallman + +@c Include the Distribution inside the titlepage environment so +@c that headings are turned off. Headings on and off do not work. + +@page +@vskip 0pt plus 1filll +Copyright @copyright{} 1989, 1991, 1992, 1993 Free Software Foundation, Inc. +@sp 2 + +@c !!set edition, date, version +This is Edition 0.15 of @cite{The GAWK Manual}, @* +for the 2.15 version of the GNU implementation @* +of AWK. + +@sp 2 +Published by the Free Software Foundation @* +675 Massachusetts Avenue @* +Cambridge, MA 02139 USA @* +Printed copies are available for $20 each. + +Permission is granted to make and distribute verbatim copies of +this manual provided the copyright notice and this permission notice +are preserved on all copies. + +Permission is granted to copy and distribute modified versions of this +manual under the conditions for verbatim copying, provided that the entire +resulting derived work is distributed under the terms of a permission +notice identical to this one. + +Permission is granted to copy and distribute translations of this manual +into another language, under the above conditions for modified versions, +except that this permission notice may be stated in a translation approved +by the Foundation. +@end titlepage + +@ifinfo +@node Top, Preface, (dir), (dir) +@comment node-name, next, previous, up +@top General Introduction +@c Preface or Licensing nodes should come right after the Top +@c node, in `unnumbered' sections, then the chapter, `What is gawk'. + +This file documents @code{awk}, a program that you can use to select +particular records in a file and perform operations upon them. + +@c !!set edition, date, version +This is Edition 0.15 of @cite{The GAWK Manual}, @* +for the 2.15 version of the GNU implementation @* +of AWK. + +@end ifinfo + +@menu +* Preface:: What you can do with @code{awk}; brief history + and acknowledgements. +* Copying:: Your right to copy and distribute @code{gawk}. +* This Manual:: Using this manual. + Includes sample input files that you can use. +* Getting Started:: A basic introduction to using @code{awk}. + How to run an @code{awk} program. + Command line syntax. +* Reading Files:: How to read files and manipulate fields. +* Printing:: How to print using @code{awk}. Describes the + @code{print} and @code{printf} statements. + Also describes redirection of output. +* One-liners:: Short, sample @code{awk} programs. +* Patterns:: The various types of patterns + explained in detail. +* Actions:: The various types of actions are + introduced here. Describes + expressions and the various operators in + detail. Also describes comparison expressions. +* Expressions:: Expressions are the basic building + blocks of statements. +* Statements:: The various control statements are + described in detail. +* Arrays:: The description and use of arrays. + Also includes array-oriented control + statements. +* Built-in:: The built-in functions are summarized here. +* User-defined:: User-defined functions are described in detail. +* Built-in Variables:: Built-in Variables +* Command Line:: How to run @code{gawk}. +* Language History:: The evolution of the @code{awk} language. +* Installation:: Installing @code{gawk} under + various operating systems. +* Gawk Summary:: @code{gawk} Options and Language Summary. +* Sample Program:: A sample @code{awk} program with a + complete explanation. +* Bugs:: Reporting Problems and Bugs. +* Notes:: Something about the + implementation of @code{gawk}. +* Glossary:: An explanation of some unfamiliar terms. +* Index:: +@end menu + +@node Preface, Copying, Top, Top +@comment node-name, next, previous, up +@unnumbered Preface + +@iftex +@cindex what is @code{awk} +@end iftex +If you are like many computer users, you would frequently like to make +changes in various text files wherever certain patterns appear, or +extract data from parts of certain lines while discarding the rest. To +write a program to do this in a language such as C or Pascal is a +time-consuming inconvenience that may take many lines of code. The job +may be easier with @code{awk}. + +The @code{awk} utility interprets a special-purpose programming language +that makes it possible to handle simple data-reformatting jobs easily +with just a few lines of code. + +The GNU implementation of @code{awk} is called @code{gawk}; it is fully +upward compatible with the System V Release 4 version of +@code{awk}. @code{gawk} is also upward compatible with the @sc{posix} +(draft) specification of the @code{awk} language. This means that all +properly written @code{awk} programs should work with @code{gawk}. +Thus, we usually don't distinguish between @code{gawk} and other @code{awk} +implementations in this manual.@refill + +@cindex uses of @code{awk} +This manual teaches you what @code{awk} does and how you can use +@code{awk} effectively. You should already be familiar with basic +system commands such as @code{ls}. Using @code{awk} you can: @refill + +@itemize @bullet +@item +manage small, personal databases + +@item +generate reports + +@item +validate data +@item +produce indexes, and perform other document preparation tasks + +@item +even experiment with algorithms that can be adapted later to other computer +languages +@end itemize + +@iftex +This manual has the difficult task of being both tutorial and reference. +If you are a novice, feel free to skip over details that seem too complex. +You should also ignore the many cross references; they are for the +expert user, and for the on-line Info version of the manual. +@end iftex + +@menu +* History:: The history of @code{gawk} and + @code{awk}. Acknowledgements. +@end menu + +@node History, , Preface, Preface +@comment node-name, next, previous, up +@unnumberedsec History of @code{awk} and @code{gawk} + +@cindex acronym +@cindex history of @code{awk} +The name @code{awk} comes from the initials of its designers: Alfred V. +Aho, Peter J. Weinberger, and Brian W. Kernighan. The original version of +@code{awk} was written in 1977. In 1985 a new version made the programming +language more powerful, introducing user-defined functions, multiple input +streams, and computed regular expressions. +This new version became generally available with System V Release 3.1. +The version in System V Release 4 added some new features and also cleaned +up the behavior in some of the ``dark corners'' of the language. +The specification for @code{awk} in the @sc{posix} Command Language +and Utilities standard further clarified the language based on feedback +from both the @code{gawk} designers, and the original @code{awk} +designers.@refill + +The GNU implementation, @code{gawk}, was written in 1986 by Paul Rubin +and Jay Fenlason, with advice from Richard Stallman. John Woods +contributed parts of the code as well. In 1988 and 1989, David Trueman, with +help from Arnold Robbins, thoroughly reworked @code{gawk} for compatibility +with the newer @code{awk}. Current development (1992) focuses on bug fixes, +performance improvements, and standards compliance. + +We need to thank many people for their assistance in producing this +manual. Jay Fenlason contributed many ideas and sample programs. Richard +Mlynarik and Robert J. Chassell gave helpful comments on early drafts of this +manual. The paper @cite{A Supplemental Document for @code{awk}} by John W. +Pierce of the Chemistry Department at UC San Diego, pinpointed several +issues relevant both to @code{awk} implementation and to this manual, that +would otherwise have escaped us. David Trueman, Pat Rankin, and Michal +Jaegermann also contributed sections of the manual.@refill + +The following people provided many helpful comments on this edition of +the manual: Rick Adams, Michael Brennan, Rich Burridge, Diane Close, +Christopher (``Topher'') Eliot, Michael Lijewski, Pat Rankin, Miriam Robbins, +and Michal Jaegermann. Robert J. Chassell provided much valuable advice on +the use of Texinfo. + +Finally, we would like to thank Brian Kernighan of Bell Labs for invaluable +assistance during the testing and debugging of @code{gawk}, and for +help in clarifying numerous points about the language.@refill + +@node Copying, This Manual, Preface, Top +@unnumbered GNU GENERAL PUBLIC LICENSE +@center Version 2, June 1991 + +@display +Copyright @copyright{} 1989, 1991 Free Software Foundation, Inc. +675 Mass Ave, Cambridge, MA 02139, USA + +Everyone is permitted to copy and distribute verbatim copies +of this license document, but changing it is not allowed. +@end display + +@c fakenode --- for prepinfo +@unnumberedsec Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software---to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Library General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + +@iftex +@c fakenode --- for prepinfo +@unnumberedsec TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION +@end iftex +@ifinfo +@center TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION +@end ifinfo + +@enumerate +@item +This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The ``Program'', below, +refers to any such program or work, and a ``work based on the Program'' +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term ``modification''.) Each licensee is addressed as ``you''. + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + +@item +You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + +@item +You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + +@enumerate a +@item +You must cause the modified files to carry prominent notices +stating that you changed the files and the date of any change. + +@item +You must cause any work that you distribute or publish, that in +whole or in part contains or is derived from the Program or any +part thereof, to be licensed as a whole at no charge to all third +parties under the terms of this License. + +@item +If the modified program normally reads commands interactively +when run, you must cause it, when started running for such +interactive use in the most ordinary way, to print or display an +announcement including an appropriate copyright notice and a +notice that there is no warranty (or else, saying that you provide +a warranty) and that users may redistribute the program under +these conditions, and telling the user how to view a copy of this +License. (Exception: if the Program itself is interactive but +does not normally print such an announcement, your work based on +the Program is not required to print an announcement.) +@end enumerate + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + +@item +You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + +@enumerate a +@item +Accompany it with the complete corresponding machine-readable +source code, which must be distributed under the terms of Sections +1 and 2 above on a medium customarily used for software interchange; or, + +@item +Accompany it with a written offer, valid for at least three +years, to give any third party, for a charge no more than your +cost of physically performing source distribution, a complete +machine-readable copy of the corresponding source code, to be +distributed under the terms of Sections 1 and 2 above on a medium +customarily used for software interchange; or, + +@item +Accompany it with the information you received as to the offer +to distribute corresponding source code. (This alternative is +allowed only for noncommercial distribution and only if you +received the program in object code or executable form with such +an offer, in accord with Subsection b above.) +@end enumerate + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + +@item +You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + +@item +You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + +@item +Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + +@item +If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + +@item +If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + +@item +The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and ``any +later version'', you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + +@item +If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + +@iftex +@c fakenode --- for prepinfo +@heading NO WARRANTY +@end iftex +@ifinfo +@center NO WARRANTY +@end ifinfo + +@item +BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM ``AS IS'' WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + +@item +IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. +@end enumerate + +@iftex +@c fakenode --- for prepinfo +@heading END OF TERMS AND CONDITIONS +@end iftex +@ifinfo +@center END OF TERMS AND CONDITIONS +@end ifinfo + +@page +@c fakenode --- for prepinfo +@unnumberedsec How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least +the ``copyright'' line and a pointer to where the full notice is found. + +@smallexample +@var{one line to give the program's name and a brief idea of what it does.} +Copyright (C) 19@var{yy} @var{name of author} + +This program is free software; you can redistribute it and/or modify +it under the terms of the GNU General Public License as published by +the Free Software Foundation; either version 2 of the License, or +(at your option) any later version. + +This program is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the +GNU General Public License for more details. + +You should have received a copy of the GNU General Public License +along with this program; if not, write to the Free Software +Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. +@end smallexample + +Also add information on how to contact you by electronic and paper mail. + +If the program is interactive, make it output a short notice like this +when it starts in an interactive mode: + +@smallexample +Gnomovision version 69, Copyright (C) 19@var{yy} @var{name of author} +Gnomovision comes with ABSOLUTELY NO WARRANTY; for details +type `show w'. +This is free software, and you are welcome to redistribute it +under certain conditions; type `show c' for details. +@end smallexample + +The hypothetical commands @samp{show w} and @samp{show c} should show +the appropriate parts of the General Public License. Of course, the +commands you use may be called something other than @samp{show w} and +@samp{show c}; they could even be mouse-clicks or menu items---whatever +suits your program. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a ``copyright disclaimer'' for the program, if +necessary. Here is a sample; alter the names: + +@smallexample +Yoyodyne, Inc., hereby disclaims all copyright interest in the program +`Gnomovision' (which makes passes at compilers) written by James Hacker. + +@var{signature of Ty Coon}, 1 April 1989 +Ty Coon, President of Vice +@end smallexample + +This General Public License does not permit incorporating your program into +proprietary programs. If your program is a subroutine library, you may +consider it more useful to permit linking proprietary applications with the +library. If this is what you want to do, use the GNU Library General +Public License instead of this License. + +@node This Manual, Getting Started, Copying, Top +@chapter Using this Manual +@cindex manual, using this +@cindex using this manual +@cindex language, @code{awk} +@cindex program, @code{awk} +@cindex @code{awk} language +@cindex @code{awk} program + +The term @code{awk} refers to a particular program, and to the language you +use to tell this program what to do. When we need to be careful, we call +the program ``the @code{awk} utility'' and the language ``the @code{awk} +language.'' The term @code{gawk} refers to a version of @code{awk} developed +as part the GNU project. The purpose of this manual is to explain +both the +@code{awk} language and how to run the @code{awk} utility.@refill + +While concentrating on the features of @code{gawk}, the manual will also +attempt to describe important differences between @code{gawk} and other +@code{awk} implementations. In particular, any features that are not +in the @sc{posix} standard for @code{awk} will be noted. @refill + +The term @dfn{@code{awk} program} refers to a program written by you in +the @code{awk} programming language.@refill + +@xref{Getting Started, ,Getting Started with @code{awk}}, for the bare +essentials you need to know to start using @code{awk}. + +Some useful ``one-liners'' are included to give you a feel for the +@code{awk} language (@pxref{One-liners, ,Useful ``One-liners''}). + +@ignore +@strong{I deleted four paragraphs here because they would confuse the +beginner more than help him. They mention terms such as ``field,'' +``pattern,'' ``action,'' ``built-in function'' which the beginner +doesn't know.} + +@strong{If you can find a way to introduce several of these concepts here, +enough to give the reader a map of what is to follow, that might +be useful. I'm not sure that can be done without taking up more +space than ought to be used here. There may be no way to win.} + +@strong{ADR: I'd like to tackle this in phase 2 of my editing.} +@end ignore + +A sample @code{awk} program has been provided for you +(@pxref{Sample Program}).@refill + +If you find terms that you aren't familiar with, try looking them +up in the glossary (@pxref{Glossary}).@refill + +The entire @code{awk} language is summarized for quick reference in +@ref{Gawk Summary, ,@code{gawk} Summary}. Look there if you just need +to refresh your memory about a particular feature.@refill + +Most of the time complete @code{awk} programs are used as examples, but in +some of the more advanced sections, only the part of the @code{awk} program +that illustrates the concept being described is shown.@refill + +@menu +* Sample Data Files:: Sample data files for use in the @code{awk} + programs illustrated in this manual. +@end menu + +@node Sample Data Files, , This Manual, This Manual +@section Data Files for the Examples + +@cindex input file, sample +@cindex sample input file +@cindex @file{BBS-list} file +Many of the examples in this manual take their input from two sample +data files. The first, called @file{BBS-list}, represents a list of +computer bulletin board systems together with information about those systems. +The second data file, called @file{inventory-shipped}, contains +information about shipments on a monthly basis. Each line of these +files is one @dfn{record}. + +In the file @file{BBS-list}, each record contains the name of a computer +bulletin board, its phone number, the board's baud rate, and a code for +the number of hours it is operational. An @samp{A} in the last column +means the board operates 24 hours a day. A @samp{B} in the last +column means the board operates evening and weekend hours, only. A +@samp{C} means the board operates only on weekends. + +@example +aardvark 555-5553 1200/300 B +alpo-net 555-3412 2400/1200/300 A +barfly 555-7685 1200/300 A +bites 555-1675 2400/1200/300 A +camelot 555-0542 300 C +core 555-2912 1200/300 C +fooey 555-1234 2400/1200/300 B +foot 555-6699 1200/300 B +macfoo 555-6480 1200/300 A +sdace 555-3430 2400/1200/300 A +sabafoo 555-2127 1200/300 C +@end example + +@cindex @file{inventory-shipped} file +The second data file, called @file{inventory-shipped}, represents +information about shipments during the year. +Each record contains the month of the year, the number +of green crates shipped, the number of red boxes shipped, the number of +orange bags shipped, and the number of blue packages shipped, +respectively. There are 16 entries, covering the 12 months of one year +and 4 months of the next year.@refill + +@example +Jan 13 25 15 115 +Feb 15 32 24 226 +Mar 15 24 34 228 +Apr 31 52 63 420 +May 16 34 29 208 +Jun 31 42 75 492 +Jul 24 34 67 436 +Aug 15 34 47 316 +Sep 13 55 37 277 +Oct 29 54 68 525 +Nov 20 87 82 577 +Dec 17 35 61 401 + +Jan 21 36 64 620 +Feb 26 58 80 652 +Mar 24 75 70 495 +Apr 21 70 74 514 +@end example + +@ifinfo +If you are reading this in GNU Emacs using Info, you can copy the regions +of text showing these sample files into your own test files. This way you +can try out the examples shown in the remainder of this document. You do +this by using the command @kbd{M-x write-region} to copy text from the Info +file into a file for use with @code{awk} +(@xref{Misc File Ops, , , emacs, GNU Emacs Manual}, +for more information). Using this information, create your own +@file{BBS-list} and @file{inventory-shipped} files, and practice what you +learn in this manual. +@end ifinfo + +@node Getting Started, Reading Files, This Manual, Top +@chapter Getting Started with @code{awk} +@cindex script, definition of +@cindex rule, definition of +@cindex program, definition of +@cindex basic function of @code{gawk} + +The basic function of @code{awk} is to search files for lines (or other +units of text) that contain certain patterns. When a line matches one +of the patterns, @code{awk} performs specified actions on that line. +@code{awk} keeps processing input lines in this way until the end of the +input file is reached.@refill + +When you run @code{awk}, you specify an @code{awk} @dfn{program} which +tells @code{awk} what to do. The program consists of a series of +@dfn{rules}. (It may also contain @dfn{function definitions}, but that +is an advanced feature, so we will ignore it for now. +@xref{User-defined, ,User-defined Functions}.) Each rule specifies one +pattern to search for, and one action to perform when that pattern is found. + +Syntactically, a rule consists of a pattern followed by an action. The +action is enclosed in curly braces to separate it from the pattern. +Rules are usually separated by newlines. Therefore, an @code{awk} +program looks like this: + +@example +@var{pattern} @{ @var{action} @} +@var{pattern} @{ @var{action} @} +@dots{} +@end example + +@menu +* Very Simple:: A very simple example. +* Two Rules:: A less simple one-line example with two rules. +* More Complex:: A more complex example. +* Running gawk:: How to run @code{gawk} programs; + includes command line syntax. +* Comments:: Adding documentation to @code{gawk} programs. +* Statements/Lines:: Subdividing or combining statements into lines. +* When:: When to use @code{gawk} and + when to use other things. +@end menu + +@node Very Simple, Two Rules, Getting Started, Getting Started +@section A Very Simple Example + +@cindex @samp{print $0} +The following command runs a simple @code{awk} program that searches the +input file @file{BBS-list} for the string of characters: @samp{foo}. (A +string of characters is usually called, a @dfn{string}. +The term @dfn{string} is perhaps based on similar usage in English, such +as ``a string of pearls,'' or, ``a string of cars in a train.'') + +@example +awk '/foo/ @{ print $0 @}' BBS-list +@end example + +@noindent +When lines containing @samp{foo} are found, they are printed, because +@w{@samp{print $0}} means print the current line. (Just @samp{print} by +itself means the same thing, so we could have written that +instead.) + +You will notice that slashes, @samp{/}, surround the string @samp{foo} +in the actual @code{awk} program. The slashes indicate that @samp{foo} +is a pattern to search for. This type of pattern is called a +@dfn{regular expression}, and is covered in more detail later +(@pxref{Regexp, ,Regular Expressions as Patterns}). There are +single-quotes around the @code{awk} program so that the shell won't +interpret any of it as special shell characters.@refill + +Here is what this program prints: + +@example +@group +fooey 555-1234 2400/1200/300 B +foot 555-6699 1200/300 B +macfoo 555-6480 1200/300 A +sabafoo 555-2127 1200/300 C +@end group +@end example + +@cindex action, default +@cindex pattern, default +@cindex default action +@cindex default pattern +In an @code{awk} rule, either the pattern or the action can be omitted, +but not both. If the pattern is omitted, then the action is performed +for @emph{every} input line. If the action is omitted, the default +action is to print all lines that match the pattern. + +Thus, we could leave out the action (the @code{print} statement and the curly +braces) in the above example, and the result would be the same: all +lines matching the pattern @samp{foo} would be printed. By comparison, +omitting the @code{print} statement but retaining the curly braces makes an +empty action that does nothing; then no lines would be printed. + +@node Two Rules, More Complex, Very Simple, Getting Started +@section An Example with Two Rules +@cindex how @code{awk} works + +The @code{awk} utility reads the input files one line at a +time. For each line, @code{awk} tries the patterns of each of the rules. +If several patterns match then several actions are run, in the order in +which they appear in the @code{awk} program. If no patterns match, then +no actions are run. + +After processing all the rules (perhaps none) that match the line, +@code{awk} reads the next line (however, +@pxref{Next Statement, ,The @code{next} Statement}). This continues +until the end of the file is reached.@refill + +For example, the @code{awk} program: + +@example +/12/ @{ print $0 @} +/21/ @{ print $0 @} +@end example + +@noindent +contains two rules. The first rule has the string @samp{12} as the +pattern and @samp{print $0} as the action. The second rule has the +string @samp{21} as the pattern and also has @samp{print $0} as the +action. Each rule's action is enclosed in its own pair of braces. + +This @code{awk} program prints every line that contains the string +@samp{12} @emph{or} the string @samp{21}. If a line contains both +strings, it is printed twice, once by each rule. + +If we run this program on our two sample data files, @file{BBS-list} and +@file{inventory-shipped}, as shown here: + +@example +awk '/12/ @{ print $0 @} + /21/ @{ print $0 @}' BBS-list inventory-shipped +@end example + +@noindent +we get the following output: + +@example +aardvark 555-5553 1200/300 B +alpo-net 555-3412 2400/1200/300 A +barfly 555-7685 1200/300 A +bites 555-1675 2400/1200/300 A +core 555-2912 1200/300 C +fooey 555-1234 2400/1200/300 B +foot 555-6699 1200/300 B +macfoo 555-6480 1200/300 A +sdace 555-3430 2400/1200/300 A +sabafoo 555-2127 1200/300 C +sabafoo 555-2127 1200/300 C +Jan 21 36 64 620 +Apr 21 70 74 514 +@end example + +@noindent +Note how the line in @file{BBS-list} beginning with @samp{sabafoo} +was printed twice, once for each rule. + +@node More Complex, Running gawk, Two Rules, Getting Started +@comment node-name, next, previous, up +@section A More Complex Example + +Here is an example to give you an idea of what typical @code{awk} +programs do. This example shows how @code{awk} can be used to +summarize, select, and rearrange the output of another utility. It uses +features that haven't been covered yet, so don't worry if you don't +understand all the details. + +@example +ls -l | awk '$5 == "Nov" @{ sum += $4 @} + END @{ print sum @}' +@end example + +This command prints the total number of bytes in all the files in the +current directory that were last modified in November (of any year). +(In the C shell you would need to type a semicolon and then a backslash +at the end of the first line; in a @sc{posix}-compliant shell, such as the +Bourne shell or the Bourne-Again shell, you can type the example as shown.) + +The @w{@samp{ls -l}} part of this example is a command that gives you a +listing of the files in a directory, including file size and date. +Its output looks like this:@refill + +@example +-rw-r--r-- 1 close 1933 Nov 7 13:05 Makefile +-rw-r--r-- 1 close 10809 Nov 7 13:03 gawk.h +-rw-r--r-- 1 close 983 Apr 13 12:14 gawk.tab.h +-rw-r--r-- 1 close 31869 Jun 15 12:20 gawk.y +-rw-r--r-- 1 close 22414 Nov 7 13:03 gawk1.c +-rw-r--r-- 1 close 37455 Nov 7 13:03 gawk2.c +-rw-r--r-- 1 close 27511 Dec 9 13:07 gawk3.c +-rw-r--r-- 1 close 7989 Nov 7 13:03 gawk4.c +@end example + +@noindent +The first field contains read-write permissions, the second field contains +the number of links to the file, and the third field identifies the owner of +the file. The fourth field contains the size of the file in bytes. The +fifth, sixth, and seventh fields contain the month, day, and time, +respectively, that the file was last modified. Finally, the eighth field +contains the name of the file. + +The @code{$5 == "Nov"} in our @code{awk} program is an expression that +tests whether the fifth field of the output from @w{@samp{ls -l}} +matches the string @samp{Nov}. Each time a line has the string +@samp{Nov} in its fifth field, the action @samp{@{ sum += $4 @}} is +performed. This adds the fourth field (the file size) to the variable +@code{sum}. As a result, when @code{awk} has finished reading all the +input lines, @code{sum} is the sum of the sizes of files whose +lines matched the pattern. (This works because @code{awk} variables +are automatically initialized to zero.)@refill + +After the last line of output from @code{ls} has been processed, the +@code{END} rule is executed, and the value of @code{sum} is +printed. In this example, the value of @code{sum} would be 80600.@refill + +These more advanced @code{awk} techniques are covered in later sections +(@pxref{Actions, ,Overview of Actions}). Before you can move on to more +advanced @code{awk} programming, you have to know how @code{awk} interprets +your input and displays your output. By manipulating fields and using +@code{print} statements, you can produce some very useful and spectacular +looking reports.@refill + +@node Running gawk, Comments, More Complex, Getting Started +@section How to Run @code{awk} Programs + +@ignore +Date: Mon, 26 Aug 91 09:48:10 +0200 +From: gatech!vsoc07.cern.ch!matheys (Jean-Pol Matheys (CERN - ECP Division)) +To: uunet.UU.NET!skeeve!arnold +Subject: RE: status check + +The introduction of Chapter 2 (i.e. before 2.1) should include +the whole of section 2.4 - it's better to tell people how to run awk programs +before giving any examples + +ADR --- he's right. but for now, don't do this because the rest of the +chapter would need some rewriting. +@end ignore + +@cindex command line formats +@cindex running @code{awk} programs +There are several ways to run an @code{awk} program. If the program is +short, it is easiest to include it in the command that runs @code{awk}, +like this: + +@example +awk '@var{program}' @var{input-file1} @var{input-file2} @dots{} +@end example + +@noindent +where @var{program} consists of a series of patterns and actions, as +described earlier. + +When the program is long, it is usually more convenient to put it in a file +and run it with a command like this: + +@example +awk -f @var{program-file} @var{input-file1} @var{input-file2} @dots{} +@end example + +@menu +* One-shot:: Running a short throw-away @code{awk} program. +* Read Terminal:: Using no input files (input from + terminal instead). +* Long:: Putting permanent @code{awk} programs in files. +* Executable Scripts:: Making self-contained @code{awk} programs. +@end menu + +@node One-shot, Read Terminal, Running gawk, Running gawk +@subsection One-shot Throw-away @code{awk} Programs + +Once you are familiar with @code{awk}, you will often type simple +programs at the moment you want to use them. Then you can write the +program as the first argument of the @code{awk} command, like this: + +@example +awk '@var{program}' @var{input-file1} @var{input-file2} @dots{} +@end example + +@noindent +where @var{program} consists of a series of @var{patterns} and +@var{actions}, as described earlier. + +@cindex single quotes, why needed +This command format instructs the shell to start @code{awk} and use the +@var{program} to process records in the input file(s). There are single +quotes around @var{program} so that the shell doesn't interpret any +@code{awk} characters as special shell characters. They also cause the +shell to treat all of @var{program} as a single argument for +@code{awk} and allow @var{program} to be more than one line long.@refill + +This format is also useful for running short or medium-sized @code{awk} +programs from shell scripts, because it avoids the need for a separate +file for the @code{awk} program. A self-contained shell script is more +reliable since there are no other files to misplace. + +@node Read Terminal, Long, One-shot, Running gawk +@subsection Running @code{awk} without Input Files + +@cindex standard input +@cindex input, standard +You can also run @code{awk} without any input files. If you type the +command line:@refill + +@example +awk '@var{program}' +@end example + +@noindent +then @code{awk} applies the @var{program} to the @dfn{standard input}, +which usually means whatever you type on the terminal. This continues +until you indicate end-of-file by typing @kbd{Control-d}. + +For example, if you execute this command: + +@example +awk '/th/' +@end example + +@noindent +whatever you type next is taken as data for that @code{awk} +program. If you go on to type the following data: + +@example +Kathy +Ben +Tom +Beth +Seth +Karen +Thomas +@kbd{Control-d} +@end example + +@noindent +then @code{awk} prints this output: + +@example +Kathy +Beth +Seth +@end example + +@noindent +@cindex case sensitivity +@cindex pattern, case sensitive +as matching the pattern @samp{th}. Notice that it did not recognize +@samp{Thomas} as matching the pattern. The @code{awk} language is +@dfn{case sensitive}, and matches patterns exactly. (However, you can +override this with the variable @code{IGNORECASE}. +@xref{Case-sensitivity, ,Case-sensitivity in Matching}.) + +@node Long, Executable Scripts, Read Terminal, Running gawk +@subsection Running Long Programs + +@cindex running long programs +@cindex @samp{-f} option +@cindex program file +@cindex file, @code{awk} program +Sometimes your @code{awk} programs can be very long. In this case it is +more convenient to put the program into a separate file. To tell +@code{awk} to use that file for its program, you type:@refill + +@example +awk -f @var{source-file} @var{input-file1} @var{input-file2} @dots{} +@end example + +The @samp{-f} instructs the @code{awk} utility to get the @code{awk} program +from the file @var{source-file}. Any file name can be used for +@var{source-file}. For example, you could put the program:@refill + +@example +/th/ +@end example + +@noindent +into the file @file{th-prog}. Then this command: + +@example +awk -f th-prog +@end example + +@noindent +does the same thing as this one: + +@example +awk '/th/' +@end example + +@noindent +which was explained earlier (@pxref{Read Terminal, ,Running @code{awk} without Input Files}). +Note that you don't usually need single quotes around the file name that you +specify with @samp{-f}, because most file names don't contain any of the shell's +special characters. Notice that in @file{th-prog}, the @code{awk} +program did not have single quotes around it. The quotes are only needed +for programs that are provided on the @code{awk} command line. + +If you want to identify your @code{awk} program files clearly as such, +you can add the extension @file{.awk} to the file name. This doesn't +affect the execution of the @code{awk} program, but it does make +``housekeeping'' easier. + +@node Executable Scripts, , Long, Running gawk +@c node-name, next, previous, up +@subsection Executable @code{awk} Programs +@cindex executable scripts +@cindex scripts, executable +@cindex self contained programs +@cindex program, self contained +@cindex @samp{#!} + +Once you have learned @code{awk}, you may want to write self-contained +@code{awk} scripts, using the @samp{#!} script mechanism. You can do +this on many Unix systems @footnote{The @samp{#!} mechanism works on +Unix systems derived from Berkeley Unix, System V Release 4, and some System +V Release 3 systems.} (and someday on GNU).@refill + +For example, you could create a text file named @file{hello}, containing +the following (where @samp{BEGIN} is a feature we have not yet +discussed): + +@example +#! /bin/awk -f + +# a sample awk program +BEGIN @{ print "hello, world" @} +@end example + +@noindent +After making this file executable (with the @code{chmod} command), you +can simply type: + +@example +hello +@end example + +@noindent +at the shell, and the system will arrange to run @code{awk} @footnote{The +line beginning with @samp{#!} lists the full pathname of an interpreter +to be run, and an optional initial command line argument to pass to that +interpreter. The operating system then runs the interpreter with the given +argument and the full argument list of the executed program. The first argument +in the list is the full pathname of the @code{awk} program. The rest of the +argument list will either be options to @code{awk}, or data files, +or both.} as if you had typed:@refill + +@example +awk -f hello +@end example + +@noindent +Self-contained @code{awk} scripts are useful when you want to write a +program which users can invoke without knowing that the program is +written in @code{awk}. + +@cindex shell scripts +@cindex scripts, shell +If your system does not support the @samp{#!} mechanism, you can get a +similar effect using a regular shell script. It would look something +like this: + +@example +: The colon makes sure this script is executed by the Bourne shell. +awk '@var{program}' "$@@" +@end example + +Using this technique, it is @emph{vital} to enclose the @var{program} in +single quotes to protect it from interpretation by the shell. If you +omit the quotes, only a shell wizard can predict the results. + +The @samp{"$@@"} causes the shell to forward all the command line +arguments to the @code{awk} program, without interpretation. The first +line, which starts with a colon, is used so that this shell script will +work even if invoked by a user who uses the C shell. +@c Someday: (See @cite{The Bourne Again Shell}, by ??.) + +@node Comments, Statements/Lines, Running gawk, Getting Started +@section Comments in @code{awk} Programs +@cindex @samp{#} +@cindex comments +@cindex use of comments +@cindex documenting @code{awk} programs +@cindex programs, documenting + +A @dfn{comment} is some text that is included in a program for the sake +of human readers, and that is not really part of the program. Comments +can explain what the program does, and how it works. Nearly all +programming languages have provisions for comments, because programs are +typically hard to understand without their extra help. + +In the @code{awk} language, a comment starts with the sharp sign +character, @samp{#}, and continues to the end of the line. The +@code{awk} language ignores the rest of a line following a sharp sign. +For example, we could have put the following into @file{th-prog}:@refill + +@smallexample +# This program finds records containing the pattern @samp{th}. This is how +# you continue comments on additional lines. +/th/ +@end smallexample + +You can put comment lines into keyboard-composed throw-away @code{awk} +programs also, but this usually isn't very useful; the purpose of a +comment is to help you or another person understand the program at +a later time.@refill + +@node Statements/Lines, When, Comments, Getting Started +@section @code{awk} Statements versus Lines + +Most often, each line in an @code{awk} program is a separate statement or +separate rule, like this: + +@example +awk '/12/ @{ print $0 @} + /21/ @{ print $0 @}' BBS-list inventory-shipped +@end example + +But sometimes statements can be more than one line, and lines can +contain several statements. You can split a statement into multiple +lines by inserting a newline after any of the following:@refill + +@example +, @{ ? : || && do else +@end example + +@noindent +A newline at any other point is considered the end of the statement. +(Splitting lines after @samp{?} and @samp{:} is a minor @code{gawk} +extension. The @samp{?} and @samp{:} referred to here is the +three operand conditional expression described in +@ref{Conditional Exp, ,Conditional Expressions}.)@refill + +@cindex backslash continuation +@cindex continuation of lines +If you would like to split a single statement into two lines at a point +where a newline would terminate it, you can @dfn{continue} it by ending the +first line with a backslash character, @samp{\}. This is allowed +absolutely anywhere in the statement, even in the middle of a string or +regular expression. For example: + +@example +awk '/This program is too long, so continue it\ + on the next line/ @{ print $1 @}' +@end example + +@noindent +We have generally not used backslash continuation in the sample programs in +this manual. Since in @code{gawk} there is no limit on the length of a line, +it is never strictly necessary; it just makes programs prettier. We have +preferred to make them even more pretty by keeping the statements short. +Backslash continuation is most useful when your @code{awk} program is in a +separate source file, instead of typed in on the command line. You should +also note that many @code{awk} implementations are more picky about where +you may use backslash continuation. For maximal portability of your @code{awk} +programs, it is best not to split your lines in the middle of a regular +expression or a string.@refill + +@strong{Warning: backslash continuation does not work as described above +with the C shell.} Continuation with backslash works for @code{awk} +programs in files, and also for one-shot programs @emph{provided} you +are using a @sc{posix}-compliant shell, such as the Bourne shell or the +Bourne-again shell. But the C shell used on Berkeley Unix behaves +differently! There, you must use two backslashes in a row, followed by +a newline.@refill + +@cindex multiple statements on one line +When @code{awk} statements within one rule are short, you might want to put +more than one of them on a line. You do this by separating the statements +with a semicolon, @samp{;}. +This also applies to the rules themselves. +Thus, the previous program could have been written:@refill + +@example +/12/ @{ print $0 @} ; /21/ @{ print $0 @} +@end example + +@noindent +@strong{Note:} the requirement that rules on the same line must be +separated with a semicolon is a recent change in the @code{awk} +language; it was done for consistency with the treatment of statements +within an action. + +@node When, , Statements/Lines, Getting Started +@section When to Use @code{awk} + +@cindex when to use @code{awk} +@cindex applications of @code{awk} +You might wonder how @code{awk} might be useful for you. Using additional +utility programs, more advanced patterns, field separators, arithmetic +statements, and other selection criteria, you can produce much more +complex output. The @code{awk} language is very useful for producing +reports from large amounts of raw data, such as summarizing information +from the output of other utility programs like @code{ls}. +(@xref{More Complex, ,A More Complex Example}.) + +Programs written with @code{awk} are usually much smaller than they would +be in other languages. This makes @code{awk} programs easy to compose and +use. Often @code{awk} programs can be quickly composed at your terminal, +used once, and thrown away. Since @code{awk} programs are interpreted, you +can avoid the usually lengthy edit-compile-test-debug cycle of software +development. + +Complex programs have been written in @code{awk}, including a complete +retargetable assembler for 8-bit microprocessors (@pxref{Glossary}, for +more information) and a microcode assembler for a special purpose Prolog +computer. However, @code{awk}'s capabilities are strained by tasks of +such complexity. + +If you find yourself writing @code{awk} scripts of more than, say, a few +hundred lines, you might consider using a different programming +language. Emacs Lisp is a good choice if you need sophisticated string +or pattern matching capabilities. The shell is also good at string and +pattern matching; in addition, it allows powerful use of the system +utilities. More conventional languages, such as C, C++, and Lisp, offer +better facilities for system programming and for managing the complexity +of large programs. Programs in these languages may require more lines +of source code than the equivalent @code{awk} programs, but they are +easier to maintain and usually run more efficiently.@refill + +@node Reading Files, Printing, Getting Started, Top +@chapter Reading Input Files + +@cindex reading files +@cindex input +@cindex standard input +@vindex FILENAME +In the typical @code{awk} program, all input is read either from the +standard input (by default the keyboard, but often a pipe from another +command) or from files whose names you specify on the @code{awk} command +line. If you specify input files, @code{awk} reads them in order, reading +all the data from one before going on to the next. The name of the current +input file can be found in the built-in variable @code{FILENAME} +(@pxref{Built-in Variables}).@refill + +The input is read in units called records, and processed by the +rules one record at a time. By default, each record is one line. Each +record is split automatically into fields, to make it more +convenient for a rule to work on its parts. + +On rare occasions you will need to use the @code{getline} command, +which can do explicit input from any number of files +(@pxref{Getline, ,Explicit Input with @code{getline}}).@refill + +@menu +* Records:: Controlling how data is split into records. +* Fields:: An introduction to fields. +* Non-Constant Fields:: Non-constant Field Numbers. +* Changing Fields:: Changing the Contents of a Field. +* Field Separators:: The field separator and how to change it. +* Constant Size:: Reading constant width data. +* Multiple Line:: Reading multi-line records. +* Getline:: Reading files under explicit program control + using the @code{getline} function. +* Close Input:: Closing an input file (so you can read from + the beginning once more). +@end menu + +@node Records, Fields, Reading Files, Reading Files +@section How Input is Split into Records + +@cindex record separator +The @code{awk} language divides its input into records and fields. +Records are separated by a character called the @dfn{record separator}. +By default, the record separator is the newline character, defining +a record to be a single line of text.@refill + +@iftex +@cindex changing the record separator +@end iftex +@vindex RS +Sometimes you may want to use a different character to separate your +records. You can use a different character by changing the built-in +variable @code{RS}. The value of @code{RS} is a string that says how +to separate records; the default value is @code{"\n"}, the string containing +just a newline character. This is why records are, by default, single lines. + +@code{RS} can have any string as its value, but only the first character +of the string is used as the record separator. The other characters are +ignored. @code{RS} is exceptional in this regard; @code{awk} uses the +full value of all its other built-in variables.@refill + +@ignore +Someday this should be true! + +The value of @code{RS} is not limited to a one-character string. It can +be any regular expression (@pxref{Regexp, ,Regular Expressions as Patterns}). +In general, each record +ends at the next string that matches the regular expression; the next +record starts at the end of the matching string. This general rule is +actually at work in the usual case, where @code{RS} contains just a +newline: a record ends at the beginning of the next matching string (the +next newline in the input) and the following record starts just after +the end of this string (at the first character of the following line). +The newline, since it matches @code{RS}, is not part of either record.@refill +@end ignore + +You can change the value of @code{RS} in the @code{awk} program with the +assignment operator, @samp{=} (@pxref{Assignment Ops, ,Assignment Expressions}). +The new record-separator character should be enclosed in quotation marks to make +a string constant. Often the right time to do this is at the beginning +of execution, before any input has been processed, so that the very +first record will be read with the proper separator. To do this, use +the special @code{BEGIN} pattern +(@pxref{BEGIN/END, ,@code{BEGIN} and @code{END} Special Patterns}). For +example:@refill + +@example +awk 'BEGIN @{ RS = "/" @} ; @{ print $0 @}' BBS-list +@end example + +@noindent +changes the value of @code{RS} to @code{"/"}, before reading any input. +This is a string whose first character is a slash; as a result, records +are separated by slashes. Then the input file is read, and the second +rule in the @code{awk} program (the action with no pattern) prints each +record. Since each @code{print} statement adds a newline at the end of +its output, the effect of this @code{awk} program is to copy the input +with each slash changed to a newline. + +Another way to change the record separator is on the command line, +using the variable-assignment feature +(@pxref{Command Line, ,Invoking @code{awk}}).@refill + +@example +awk '@{ print $0 @}' RS="/" BBS-list +@end example + +@noindent +This sets @code{RS} to @samp{/} before processing @file{BBS-list}. + +Reaching the end of an input file terminates the current input record, +even if the last character in the file is not the character in @code{RS}. + +@ignore +@c merge the preceding paragraph and this stuff into one paragraph +@c and put it in an `expert info' section. +This produces correct behavior in the vast majority of cases, although +the following (extreme) pipeline prints a surprising @samp{1}. (There +is one field, consisting of a newline.) + +@example +echo | awk 'BEGIN @{ RS = "a" @} ; @{ print NF @}' +@end example + +@end ignore + +The empty string, @code{""} (a string of no characters), has a special meaning +as the value of @code{RS}: it means that records are separated only +by blank lines. @xref{Multiple Line, ,Multiple-Line Records}, for more details. + +@cindex number of records, @code{NR} or @code{FNR} +@vindex NR +@vindex FNR +The @code{awk} utility keeps track of the number of records that have +been read so far from the current input file. This value is stored in a +built-in variable called @code{FNR}. It is reset to zero when a new +file is started. Another built-in variable, @code{NR}, is the total +number of input records read so far from all files. It starts at zero +but is never automatically reset to zero. + +If you change the value of @code{RS} in the middle of an @code{awk} run, +the new value is used to delimit subsequent records, but the record +currently being processed (and records already processed) are not +affected. + +@node Fields, Non-Constant Fields, Records, Reading Files +@section Examining Fields + +@cindex examining fields +@cindex fields +@cindex accessing fields +When @code{awk} reads an input record, the record is +automatically separated or @dfn{parsed} by the interpreter into chunks +called @dfn{fields}. By default, fields are separated by whitespace, +like words in a line. +Whitespace in @code{awk} means any string of one or more spaces and/or +tabs; other characters such as newline, formfeed, and so on, that are +considered whitespace by other languages are @emph{not} considered +whitespace by @code{awk}.@refill + +The purpose of fields is to make it more convenient for you to refer to +these pieces of the record. You don't have to use them---you can +operate on the whole record if you wish---but fields are what make +simple @code{awk} programs so powerful. + +@cindex @code{$} (field operator) +@cindex operators, @code{$} +To refer to a field in an @code{awk} program, you use a dollar-sign, +@samp{$}, followed by the number of the field you want. Thus, @code{$1} +refers to the first field, @code{$2} to the second, and so on. For +example, suppose the following is a line of input:@refill + +@example +This seems like a pretty nice example. +@end example + +@noindent +Here the first field, or @code{$1}, is @samp{This}; the second field, or +@code{$2}, is @samp{seems}; and so on. Note that the last field, +@code{$7}, is @samp{example.}. Because there is no space between the +@samp{e} and the @samp{.}, the period is considered part of the seventh +field.@refill + +No matter how many fields there are, the last field in a record can be +represented by @code{$NF}. So, in the example above, @code{$NF} would +be the same as @code{$7}, which is @samp{example.}. Why this works is +explained below (@pxref{Non-Constant Fields, ,Non-constant Field Numbers}). +If you try to refer to a field beyond the last one, such as @code{$8} +when the record has only 7 fields, you get the empty string.@refill + +@vindex NF +@cindex number of fields, @code{NF} +Plain @code{NF}, with no @samp{$}, is a built-in variable whose value +is the number of fields in the current record. + +@code{$0}, which looks like an attempt to refer to the zeroth field, is +a special case: it represents the whole input record. This is what you +would use if you weren't interested in fields. + +Here are some more examples: + +@example +awk '$1 ~ /foo/ @{ print $0 @}' BBS-list +@end example + +@noindent +This example prints each record in the file @file{BBS-list} whose first +field contains the string @samp{foo}. The operator @samp{~} is called a +@dfn{matching operator} (@pxref{Comparison Ops, ,Comparison Expressions}); +it tests whether a string (here, the field @code{$1}) matches a given regular +expression.@refill + +By contrast, the following example: + +@example +awk '/foo/ @{ print $1, $NF @}' BBS-list +@end example + +@noindent +looks for @samp{foo} in @emph{the entire record} and prints the first +field and the last field for each input record containing a +match.@refill + +@node Non-Constant Fields, Changing Fields, Fields, Reading Files +@section Non-constant Field Numbers + +The number of a field does not need to be a constant. Any expression in +the @code{awk} language can be used after a @samp{$} to refer to a +field. The value of the expression specifies the field number. If the +value is a string, rather than a number, it is converted to a number. +Consider this example:@refill + +@example +awk '@{ print $NR @}' +@end example + +@noindent +Recall that @code{NR} is the number of records read so far: 1 in the +first record, 2 in the second, etc. So this example prints the first +field of the first record, the second field of the second record, and so +on. For the twentieth record, field number 20 is printed; most likely, +the record has fewer than 20 fields, so this prints a blank line. + +Here is another example of using expressions as field numbers: + +@example +awk '@{ print $(2*2) @}' BBS-list +@end example + +The @code{awk} language must evaluate the expression @code{(2*2)} and use +its value as the number of the field to print. The @samp{*} sign +represents multiplication, so the expression @code{2*2} evaluates to 4. +The parentheses are used so that the multiplication is done before the +@samp{$} operation; they are necessary whenever there is a binary +operator in the field-number expression. This example, then, prints the +hours of operation (the fourth field) for every line of the file +@file{BBS-list}.@refill + +If the field number you compute is zero, you get the entire record. +Thus, @code{$(2-2)} has the same value as @code{$0}. Negative field +numbers are not allowed. + +The number of fields in the current record is stored in the built-in +variable @code{NF} (@pxref{Built-in Variables}). The expression +@code{$NF} is not a special feature: it is the direct consequence of +evaluating @code{NF} and using its value as a field number. + +@node Changing Fields, Field Separators, Non-Constant Fields, Reading Files +@section Changing the Contents of a Field + +@cindex field, changing contents of +@cindex changing contents of a field +@cindex assignment to fields +You can change the contents of a field as seen by @code{awk} within an +@code{awk} program; this changes what @code{awk} perceives as the +current input record. (The actual input is untouched: @code{awk} never +modifies the input file.) + +Consider this example: + +@smallexample +awk '@{ $3 = $2 - 10; print $2, $3 @}' inventory-shipped +@end smallexample + +@noindent +The @samp{-} sign represents subtraction, so this program reassigns +field three, @code{$3}, to be the value of field two minus ten, +@code{$2 - 10}. (@xref{Arithmetic Ops, ,Arithmetic Operators}.) +Then field two, and the new value for field three, are printed. + +In order for this to work, the text in field @code{$2} must make sense +as a number; the string of characters must be converted to a number in +order for the computer to do arithmetic on it. The number resulting +from the subtraction is converted back to a string of characters which +then becomes field three. +@xref{Conversion, ,Conversion of Strings and Numbers}.@refill + +When you change the value of a field (as perceived by @code{awk}), the +text of the input record is recalculated to contain the new field where +the old one was. Therefore, @code{$0} changes to reflect the altered +field. Thus, + +@smallexample +awk '@{ $2 = $2 - 10; print $0 @}' inventory-shipped +@end smallexample + +@noindent +prints a copy of the input file, with 10 subtracted from the second +field of each line. + +You can also assign contents to fields that are out of range. For +example: + +@smallexample +awk '@{ $6 = ($5 + $4 + $3 + $2) ; print $6 @}' inventory-shipped +@end smallexample + +@noindent +We've just created @code{$6}, whose value is the sum of fields +@code{$2}, @code{$3}, @code{$4}, and @code{$5}. The @samp{+} sign +represents addition. For the file @file{inventory-shipped}, @code{$6} +represents the total number of parcels shipped for a particular month. + +Creating a new field changes the internal @code{awk} copy of the current +input record---the value of @code{$0}. Thus, if you do @samp{print $0} +after adding a field, the record printed includes the new field, with +the appropriate number of field separators between it and the previously +existing fields. + +This recomputation affects and is affected by several features not yet +discussed, in particular, the @dfn{output field separator}, @code{OFS}, +which is used to separate the fields (@pxref{Output Separators}), and +@code{NF} (the number of fields; @pxref{Fields, ,Examining Fields}). +For example, the value of @code{NF} is set to the number of the highest +field you create.@refill + +Note, however, that merely @emph{referencing} an out-of-range field +does @emph{not} change the value of either @code{$0} or @code{NF}. +Referencing an out-of-range field merely produces a null string. For +example:@refill + +@smallexample +if ($(NF+1) != "") + print "can't happen" +else + print "everything is normal" +@end smallexample + +@noindent +should print @samp{everything is normal}, because @code{NF+1} is certain +to be out of range. (@xref{If Statement, ,The @code{if} Statement}, +for more information about @code{awk}'s @code{if-else} statements.)@refill + +It is important to note that assigning to a field will change the +value of @code{$0}, but will not change the value of @code{NF}, +even when you assign the null string to a field. For example: + +@smallexample +echo a b c d | awk '@{ OFS = ":"; $2 = "" ; print ; print NF @}' +@end smallexample + +@noindent +prints + +@smallexample +a::c:d +4 +@end smallexample + +@noindent +The field is still there, it just has an empty value. You can tell +because there are two colons in a row. + +@node Field Separators, Constant Size, Changing Fields, Reading Files +@section Specifying how Fields are Separated +@vindex FS +@cindex fields, separating +@cindex field separator, @code{FS} +@cindex @samp{-F} option + +(This section is rather long; it describes one of the most fundamental +operations in @code{awk}. If you are a novice with @code{awk}, we +recommend that you re-read this section after you have studied the +section on regular expressions, @ref{Regexp, ,Regular Expressions as Patterns}.) + +The way @code{awk} splits an input record into fields is controlled by +the @dfn{field separator}, which is a single character or a regular +expression. @code{awk} scans the input record for matches for the +separator; the fields themselves are the text between the matches. For +example, if the field separator is @samp{oo}, then the following line: + +@smallexample +moo goo gai pan +@end smallexample + +@noindent +would be split into three fields: @samp{m}, @samp{@ g} and @samp{@ gai@ +pan}. + +The field separator is represented by the built-in variable @code{FS}. +Shell programmers take note! @code{awk} does not use the name @code{IFS} +which is used by the shell.@refill + +You can change the value of @code{FS} in the @code{awk} program with the +assignment operator, @samp{=} (@pxref{Assignment Ops, ,Assignment Expressions}). +Often the right time to do this is at the beginning of execution, +before any input has been processed, so that the very first record +will be read with the proper separator. To do this, use the special +@code{BEGIN} pattern +(@pxref{BEGIN/END, ,@code{BEGIN} and @code{END} Special Patterns}). +For example, here we set the value of @code{FS} to the string +@code{","}:@refill + +@smallexample +awk 'BEGIN @{ FS = "," @} ; @{ print $2 @}' +@end smallexample + +@noindent +Given the input line, + +@smallexample +John Q. Smith, 29 Oak St., Walamazoo, MI 42139 +@end smallexample + +@noindent +this @code{awk} program extracts the string @samp{@ 29 Oak St.}. + +@cindex field separator, choice of +@cindex regular expressions as field separators +Sometimes your input data will contain separator characters that don't +separate fields the way you thought they would. For instance, the +person's name in the example we've been using might have a title or +suffix attached, such as @samp{John Q. Smith, LXIX}. From input +containing such a name: + +@smallexample +John Q. Smith, LXIX, 29 Oak St., Walamazoo, MI 42139 +@end smallexample + +@noindent +the previous sample program would extract @samp{@ LXIX}, instead of +@samp{@ 29 Oak St.}. If you were expecting the program to print the +address, you would be surprised. So choose your data layout and +separator characters carefully to prevent such problems. + +As you know, by default, fields are separated by whitespace sequences +(spaces and tabs), not by single spaces: two spaces in a row do not +delimit an empty field. The default value of the field separator is a +string @w{@code{" "}} containing a single space. If this value were +interpreted in the usual way, each space character would separate +fields, so two spaces in a row would make an empty field between them. +The reason this does not happen is that a single space as the value of +@code{FS} is a special case: it is taken to specify the default manner +of delimiting fields. + +If @code{FS} is any other single character, such as @code{","}, then +each occurrence of that character separates two fields. Two consecutive +occurrences delimit an empty field. If the character occurs at the +beginning or the end of the line, that too delimits an empty field. The +space character is the only single character which does not follow these +rules. + +More generally, the value of @code{FS} may be a string containing any +regular expression. Then each match in the record for the regular +expression separates fields. For example, the assignment:@refill + +@smallexample +FS = ", \t" +@end smallexample + +@noindent +makes every area of an input line that consists of a comma followed by a +space and a tab, into a field separator. (@samp{\t} stands for a +tab.)@refill + +For a less trivial example of a regular expression, suppose you want +single spaces to separate fields the way single commas were used above. +You can set @code{FS} to @w{@code{"[@ ]"}}. This regular expression +matches a single space and nothing else. + +@c the following index entry is an overfull hbox. --mew 30jan1992 +@cindex field separator: on command line +@cindex command line, setting @code{FS} on +@code{FS} can be set on the command line. You use the @samp{-F} argument to +do so. For example: + +@smallexample +awk -F, '@var{program}' @var{input-files} +@end smallexample + +@noindent +sets @code{FS} to be the @samp{,} character. Notice that the argument uses +a capital @samp{F}. Contrast this with @samp{-f}, which specifies a file +containing an @code{awk} program. Case is significant in command options: +the @samp{-F} and @samp{-f} options have nothing to do with each other. +You can use both options at the same time to set the @code{FS} argument +@emph{and} get an @code{awk} program from a file.@refill + +@c begin expert info +The value used for the argument to @samp{-F} is processed in exactly the +same way as assignments to the built-in variable @code{FS}. This means that +if the field separator contains special characters, they must be escaped +appropriately. For example, to use a @samp{\} as the field separator, you +would have to type: + +@smallexample +# same as FS = "\\" +awk -F\\\\ '@dots{}' files @dots{} +@end smallexample + +@noindent +Since @samp{\} is used for quoting in the shell, @code{awk} will see +@samp{-F\\}. Then @code{awk} processes the @samp{\\} for escape +characters (@pxref{Constants, ,Constant Expressions}), finally yielding +a single @samp{\} to be used for the field separator. +@c end expert info + +As a special case, in compatibility mode +(@pxref{Command Line, ,Invoking @code{awk}}), if the +argument to @samp{-F} is @samp{t}, then @code{FS} is set to the tab +character. (This is because if you type @samp{-F\t}, without the quotes, +at the shell, the @samp{\} gets deleted, so @code{awk} figures that you +really want your fields to be separated with tabs, and not @samp{t}s. +Use @samp{-v FS="t"} on the command line if you really do want to separate +your fields with @samp{t}s.)@refill + +For example, let's use an @code{awk} program file called @file{baud.awk} +that contains the pattern @code{/300/}, and the action @samp{print $1}. +Here is the program: + +@smallexample +/300/ @{ print $1 @} +@end smallexample + +Let's also set @code{FS} to be the @samp{-} character, and run the +program on the file @file{BBS-list}. The following command prints a +list of the names of the bulletin boards that operate at 300 baud and +the first three digits of their phone numbers:@refill + +@smallexample +awk -F- -f baud.awk BBS-list +@end smallexample + +@noindent +It produces this output: + +@smallexample +aardvark 555 +alpo +barfly 555 +bites 555 +camelot 555 +core 555 +fooey 555 +foot 555 +macfoo 555 +sdace 555 +sabafoo 555 +@end smallexample + +@noindent +Note the second line of output. If you check the original file, you will +see that the second line looked like this: + +@smallexample +alpo-net 555-3412 2400/1200/300 A +@end smallexample + +The @samp{-} as part of the system's name was used as the field +separator, instead of the @samp{-} in the phone number that was +originally intended. This demonstrates why you have to be careful in +choosing your field and record separators. + +The following program searches the system password file, and prints +the entries for users who have no password: + +@smallexample +awk -F: '$2 == ""' /etc/passwd +@end smallexample + +@noindent +Here we use the @samp{-F} option on the command line to set the field +separator. Note that fields in @file{/etc/passwd} are separated by +colons. The second field represents a user's encrypted password, but if +the field is empty, that user has no password. + +@c begin expert info +According to the @sc{posix} standard, @code{awk} is supposed to behave +as if each record is split into fields at the time that it is read. +In particular, this means that you can change the value of @code{FS} +after a record is read, but before any of the fields are referenced. +The value of the fields (i.e. how they were split) should reflect the +old value of @code{FS}, not the new one. + +However, many implementations of @code{awk} do not do this. Instead, +they defer splitting the fields until a field reference actually happens, +using the @emph{current} value of @code{FS}! This behavior can be difficult +to diagnose. The following example illustrates the results of the two methods. +(The @code{sed} command prints just the first line of @file{/etc/passwd}.) + +@smallexample +sed 1q /etc/passwd | awk '@{ FS = ":" ; print $1 @}' +@end smallexample + +@noindent +will usually print + +@smallexample +root +@end smallexample + +@noindent +on an incorrect implementation of @code{awk}, while @code{gawk} +will print something like + +@smallexample +root:nSijPlPhZZwgE:0:0:Root:/: +@end smallexample +@c end expert info + +@c begin expert info +There is an important difference between the two cases of @samp{FS = @w{" "}} +(a single blank) and @samp{FS = @w{"[ \t]+"}} (which is a regular expression +matching one or more blanks or tabs). For both values of @code{FS}, fields +are separated by runs of blanks and/or tabs. However, when the value of +@code{FS} is @code{" "}, @code{awk} will strip leading and trailing whitespace +from the record, and then decide where the fields are. + +For example, the following expression prints @samp{b}: + +@smallexample +echo ' a b c d ' | awk '@{ print $2 @}' +@end smallexample + +@noindent +However, the following prints @samp{a}: + +@smallexample +echo ' a b c d ' | awk 'BEGIN @{ FS = "[ \t]+" @} ; @{ print $2 @}' +@end smallexample + +@noindent +In this case, the first field is null. + +The stripping of leading and trailing whitespace also comes into +play whenever @code{$0} is recomputed. For instance, this pipeline + +@smallexample +echo ' a b c d' | awk '@{ print; $2 = $2; print @}' +@end smallexample + +@noindent +produces this output: + +@smallexample + a b c d +a b c d +@end smallexample + +@noindent +The first @code{print} statement prints the record as it was read, +with leading whitespace intact. The assignment to @code{$2} rebuilds +@code{$0} by concatenating @code{$1} through @code{$NF} together, +separated by the value of @code{OFS}. Since the leading whitespace +was ignored when finding @code{$1}, it is not part of the new @code{$0}. +Finally, the last @code{print} statement prints the new @code{$0}. +@c end expert info + +The following table summarizes how fields are split, based on the +value of @code{FS}. + +@table @code +@item FS == " " +Fields are separated by runs of whitespace. Leading and trailing +whitespace are ignored. This is the default. + +@item FS == @var{any single character} +Fields are separated by each occurrence of the character. Multiple +successive occurrences delimit empty fields, as do leading and +trailing occurrences. + +@item FS == @var{regexp} +Fields are separated by occurrences of characters that match @var{regexp}. +Leading and trailing matches of @var{regexp} delimit empty fields. +@end table + +@node Constant Size, Multiple Line, Field Separators, Reading Files +@section Reading Fixed-width Data + +(This section discusses an advanced, experimental feature. If you are +a novice @code{awk} user, you may wish to skip it on the first reading.) + +@code{gawk} 2.13 introduced a new facility for dealing with fixed-width fields +with no distinctive field separator. Data of this nature arises typically +in one of at least two ways: the input for old FORTRAN programs where +numbers are run together, and the output of programs that did not anticipate +the use of their output as input for other programs. + +An example of the latter is a table where all the columns are lined up by +the use of a variable number of spaces and @emph{empty fields are just +spaces}. Clearly, @code{awk}'s normal field splitting based on @code{FS} +will not work well in this case. (Although a portable @code{awk} program +can use a series of @code{substr} calls on @code{$0}, this is awkward and +inefficient for a large number of fields.)@refill + +The splitting of an input record into fixed-width fields is specified by +assigning a string containing space-separated numbers to the built-in +variable @code{FIELDWIDTHS}. Each number specifies the width of the field +@emph{including} columns between fields. If you want to ignore the columns +between fields, you can specify the width as a separate field that is +subsequently ignored. + +The following data is the output of the @code{w} utility. It is useful +to illustrate the use of @code{FIELDWIDTHS}. + +@smallexample + 10:06pm up 21 days, 14:04, 23 users +User tty login@ idle JCPU PCPU what +hzuo ttyV0 8:58pm 9 5 vi p24.tex +hzang ttyV3 6:37pm 50 -csh +eklye ttyV5 9:53pm 7 1 em thes.tex +dportein ttyV6 8:17pm 1:47 -csh +gierd ttyD3 10:00pm 1 elm +dave ttyD4 9:47pm 4 4 w +brent ttyp0 26Jun91 4:46 26:46 4:41 bash +dave ttyq4 26Jun9115days 46 46 wnewmail +@end smallexample + +The following program takes the above input, converts the idle time to +number of seconds and prints out the first two fields and the calculated +idle time. (This program uses a number of @code{awk} features that +haven't been introduced yet.)@refill + +@smallexample +BEGIN @{ FIELDWIDTHS = "9 6 10 6 7 7 35" @} +NR > 2 @{ + idle = $4 + sub(/^ */, "", idle) # strip leading spaces + if (idle == "") idle = 0 + if (idle ~ /:/) @{ split(idle, t, ":"); idle = t[1] * 60 + t[2] @} + if (idle ~ /days/) @{ idle *= 24 * 60 * 60 @} + + print $1, $2, idle +@} +@end smallexample + +Here is the result of running the program on the data: + +@smallexample +hzuo ttyV0 0 +hzang ttyV3 50 +eklye ttyV5 0 +dportein ttyV6 107 +gierd ttyD3 1 +dave ttyD4 0 +brent ttyp0 286 +dave ttyq4 1296000 +@end smallexample + +Another (possibly more practical) example of fixed-width input data +would be the input from a deck of balloting cards. In some parts of +the United States, voters make their choices by punching holes in computer +cards. These cards are then processed to count the votes for any particular +candidate or on any particular issue. Since a voter may choose not to +vote on some issue, any column on the card may be empty. An @code{awk} +program for processing such data could use the @code{FIELDWIDTHS} feature +to simplify reading the data.@refill + +@c of course, getting gawk to run on a system with card readers is +@c another story! + +This feature is still experimental, and will likely evolve over time. + +@node Multiple Line, Getline, Constant Size, Reading Files +@section Multiple-Line Records + +@cindex multiple line records +@cindex input, multiple line records +@cindex reading files, multiple line records +@cindex records, multiple line +In some data bases, a single line cannot conveniently hold all the +information in one entry. In such cases, you can use multi-line +records. + +The first step in doing this is to choose your data format: when records +are not defined as single lines, how do you want to define them? +What should separate records? + +One technique is to use an unusual character or string to separate +records. For example, you could use the formfeed character (written +@code{\f} in @code{awk}, as in C) to separate them, making each record +a page of the file. To do this, just set the variable @code{RS} to +@code{"\f"} (a string containing the formfeed character). Any +other character could equally well be used, as long as it won't be part +of the data in a record.@refill + +@ignore +Another technique is to have blank lines separate records. The string +@code{"^\n+"} is a regular expression that matches any sequence of +newlines starting at the beginning of a line---in other words, it +matches a sequence of blank lines. If you set @code{RS} to this string, +a record always ends at the first blank line encountered. In +addition, a regular expression always matches the longest possible +sequence when there is a choice. So the next record doesn't start until +the first nonblank line that follows---no matter how many blank lines +appear in a row, they are considered one record-separator. +@end ignore + +Another technique is to have blank lines separate records. By a special +dispensation, a null string as the value of @code{RS} indicates that +records are separated by one or more blank lines. If you set @code{RS} +to the null string, a record always ends at the first blank line +encountered. And the next record doesn't start until the first nonblank +line that follows---no matter how many blank lines appear in a row, they +are considered one record-separator. (End of file is also considered +a record separator.)@refill +@c !!! This use of `end of file' is confusing. Needs to be clarified. + +The second step is to separate the fields in the record. One way to do +this is to put each field on a separate line: to do this, just set the +variable @code{FS} to the string @code{"\n"}. (This simple regular +expression matches a single newline.) + +Another way to separate fields is to divide each of the lines into fields +in the normal manner. This happens by default as a result of a special +feature: when @code{RS} is set to the null string, the newline character +@emph{always} acts as a field separator. This is in addition to whatever +field separations result from @code{FS}. + +The original motivation for this special exception was probably so that +you get useful behavior in the default case (i.e., @w{@code{FS == " "}}). +This feature can be a problem if you really don't want the +newline character to separate fields, since there is no way to +prevent it. However, you can work around this by using the @code{split} +function to break up the record manually +(@pxref{String Functions, ,Built-in Functions for String Manipulation}).@refill + +@ignore +Here are two ways to use records separated by blank lines and break each +line into fields normally: + +@example +awk 'BEGIN @{ RS = ""; FS = "[ \t\n]+" @} @{ print $1 @}' BBS-list + +@exdent @r{or} + +awk 'BEGIN @{ RS = "^\n+"; FS = "[ \t\n]+" @} @{ print $1 @}' BBS-list +@end example +@end ignore + +@ignore +Here is how to use records separated by blank lines and break each +line into fields normally: + +@example +awk 'BEGIN @{ RS = ""; FS = "[ \t\n]+" @} ; @{ print $1 @}' BBS-list +@end example +@end ignore + +@node Getline, Close Input, Multiple Line, Reading Files +@section Explicit Input with @code{getline} + +@findex getline +@cindex input, explicit +@cindex explicit input +@cindex input, @code{getline} command +@cindex reading files, @code{getline} command +So far we have been getting our input files from @code{awk}'s main +input stream---either the standard input (usually your terminal) or the +files specified on the command line. The @code{awk} language has a +special built-in command called @code{getline} that +can be used to read input under your explicit control.@refill + +This command is quite complex and should @emph{not} be used by +beginners. It is covered here because this is the chapter on input. +The examples that follow the explanation of the @code{getline} command +include material that has not been covered yet. Therefore, come back +and study the @code{getline} command @emph{after} you have reviewed the +rest of this manual and have a good knowledge of how @code{awk} works. + +@vindex ERRNO +@cindex differences: @code{gawk} and @code{awk} +@code{getline} returns 1 if it finds a record, and 0 if the end of the +file is encountered. If there is some error in getting a record, such +as a file that cannot be opened, then @code{getline} returns @minus{}1. +In this case, @code{gawk} sets the variable @code{ERRNO} to a string +describing the error that occurred. + +In the following examples, @var{command} stands for a string value that +represents a shell command. + +@table @code +@item getline +The @code{getline} command can be used without arguments to read input +from the current input file. All it does in this case is read the next +input record and split it up into fields. This is useful if you've +finished processing the current record, but you want to do some special +processing @emph{right now} on the next record. Here's an +example:@refill + +@example +awk '@{ + if (t = index($0, "/*")) @{ + if (t > 1) + tmp = substr($0, 1, t - 1) + else + tmp = "" + u = index(substr($0, t + 2), "*/") + while (u == 0) @{ + getline + t = -1 + u = index($0, "*/") + @} + if (u <= length($0) - 2) + $0 = tmp substr($0, t + u + 3) + else + $0 = tmp + @} + print $0 +@}' +@end example + +This @code{awk} program deletes all C-style comments, @samp{/* @dots{} +*/}, from the input. By replacing the @samp{print $0} with other +statements, you could perform more complicated processing on the +decommented input, like searching for matches of a regular +expression. (This program has a subtle problem---can you spot it?) + +@c the program to remove comments doesn't work if one +@c comment ends and another begins on the same line. (Your +@c idea for restart would be useful here). --- brennan@boeing.com + +This form of the @code{getline} command sets @code{NF} (the number of +fields; @pxref{Fields, ,Examining Fields}), @code{NR} (the number of +records read so far; @pxref{Records, ,How Input is Split into Records}), +@code{FNR} (the number of records read from this input file), and the +value of @code{$0}. + +@strong{Note:} the new value of @code{$0} is used in testing +the patterns of any subsequent rules. The original value +of @code{$0} that triggered the rule which executed @code{getline} +is lost. By contrast, the @code{next} statement reads a new record +but immediately begins processing it normally, starting with the first +rule in the program. @xref{Next Statement, ,The @code{next} Statement}. + +@item getline @var{var} +This form of @code{getline} reads a record into the variable @var{var}. +This is useful when you want your program to read the next record from +the current input file, but you don't want to subject the record to the +normal input processing. + +For example, suppose the next line is a comment, or a special string, +and you want to read it, but you must make certain that it won't trigger +any rules. This version of @code{getline} allows you to read that line +and store it in a variable so that the main +read-a-line-and-check-each-rule loop of @code{awk} never sees it. + +The following example swaps every two lines of input. For example, given: + +@example +wan +tew +free +phore +@end example + +@noindent +it outputs: + +@example +tew +wan +phore +free +@end example + +@noindent +Here's the program: + +@example +@group +awk '@{ + if ((getline tmp) > 0) @{ + print tmp + print $0 + @} else + print $0 +@}' +@end group +@end example + +The @code{getline} function used in this way sets only the variables +@code{NR} and @code{FNR} (and of course, @var{var}). The record is not +split into fields, so the values of the fields (including @code{$0}) and +the value of @code{NF} do not change.@refill + +@item getline < @var{file} +@cindex input redirection +@cindex redirection of input +This form of the @code{getline} function takes its input from the file +@var{file}. Here @var{file} is a string-valued expression that +specifies the file name. @samp{< @var{file}} is called a @dfn{redirection} +since it directs input to come from a different place. + +This form is useful if you want to read your input from a particular +file, instead of from the main input stream. For example, the following +program reads its input record from the file @file{foo.input} when it +encounters a first field with a value equal to 10 in the current input +file.@refill + +@example +awk '@{ + if ($1 == 10) @{ + getline < "foo.input" + print + @} else + print +@}' +@end example + +Since the main input stream is not used, the values of @code{NR} and +@code{FNR} are not changed. But the record read is split into fields in +the normal manner, so the values of @code{$0} and other fields are +changed. So is the value of @code{NF}. + +This does not cause the record to be tested against all the patterns +in the @code{awk} program, in the way that would happen if the record +were read normally by the main processing loop of @code{awk}. However +the new record is tested against any subsequent rules, just as when +@code{getline} is used without a redirection. + +@item getline @var{var} < @var{file} +This form of the @code{getline} function takes its input from the file +@var{file} and puts it in the variable @var{var}. As above, @var{file} +is a string-valued expression that specifies the file from which to read. + +In this version of @code{getline}, none of the built-in variables are +changed, and the record is not split into fields. The only variable +changed is @var{var}. + +For example, the following program copies all the input files to the +output, except for records that say @w{@samp{@@include @var{filename}}}. +Such a record is replaced by the contents of the file +@var{filename}.@refill + +@example +awk '@{ + if (NF == 2 && $1 == "@@include") @{ + while ((getline line < $2) > 0) + print line + close($2) + @} else + print +@}' +@end example + +Note here how the name of the extra input file is not built into +the program; it is taken from the data, from the second field on +the @samp{@@include} line.@refill + +The @code{close} function is called to ensure that if two identical +@samp{@@include} lines appear in the input, the entire specified file is +included twice. @xref{Close Input, ,Closing Input Files and Pipes}.@refill + +One deficiency of this program is that it does not process nested +@samp{@@include} statements the way a true macro preprocessor would. + +@item @var{command} | getline +You can @dfn{pipe} the output of a command into @code{getline}. A pipe is +simply a way to link the output of one program to the input of another. In +this case, the string @var{command} is run as a shell command and its output +is piped into @code{awk} to be used as input. This form of @code{getline} +reads one record from the pipe. + +For example, the following program copies input to output, except for lines +that begin with @samp{@@execute}, which are replaced by the output produced by +running the rest of the line as a shell command: + +@example +awk '@{ + if ($1 == "@@execute") @{ + tmp = substr($0, 10) + while ((tmp | getline) > 0) + print + close(tmp) + @} else + print +@}' +@end example + +@noindent +The @code{close} function is called to ensure that if two identical +@samp{@@execute} lines appear in the input, the command is run for +each one. @xref{Close Input, ,Closing Input Files and Pipes}. + +Given the input: + +@example +foo +bar +baz +@@execute who +bletch +@end example + +@noindent +the program might produce: + +@example +foo +bar +baz +hack ttyv0 Jul 13 14:22 +hack ttyp0 Jul 13 14:23 (gnu:0) +hack ttyp1 Jul 13 14:23 (gnu:0) +hack ttyp2 Jul 13 14:23 (gnu:0) +hack ttyp3 Jul 13 14:23 (gnu:0) +bletch +@end example + +@noindent +Notice that this program ran the command @code{who} and printed the result. +(If you try this program yourself, you will get different results, showing +you who is logged in on your system.) + +This variation of @code{getline} splits the record into fields, sets the +value of @code{NF} and recomputes the value of @code{$0}. The values of +@code{NR} and @code{FNR} are not changed. + +@item @var{command} | getline @var{var} +The output of the command @var{command} is sent through a pipe to +@code{getline} and into the variable @var{var}. For example, the +following program reads the current date and time into the variable +@code{current_time}, using the @code{date} utility, and then +prints it.@refill + +@example +awk 'BEGIN @{ + "date" | getline current_time + close("date") + print "Report printed on " current_time +@}' +@end example + +In this version of @code{getline}, none of the built-in variables are +changed, and the record is not split into fields. +@end table + +@node Close Input, , Getline, Reading Files +@section Closing Input Files and Pipes +@cindex closing input files and pipes +@findex close + +If the same file name or the same shell command is used with +@code{getline} more than once during the execution of an @code{awk} +program, the file is opened (or the command is executed) only the first time. +At that time, the first record of input is read from that file or command. +The next time the same file or command is used in @code{getline}, another +record is read from it, and so on. + +This implies that if you want to start reading the same file again from +the beginning, or if you want to rerun a shell command (rather than +reading more output from the command), you must take special steps. +What you must do is use the @code{close} function, as follows: + +@example +close(@var{filename}) +@end example + +@noindent +or + +@example +close(@var{command}) +@end example + +The argument @var{filename} or @var{command} can be any expression. Its +value must exactly equal the string that was used to open the file or +start the command---for example, if you open a pipe with this: + +@example +"sort -r names" | getline foo +@end example + +@noindent +then you must close it with this: + +@example +close("sort -r names") +@end example + +Once this function call is executed, the next @code{getline} from that +file or command will reopen the file or rerun the command. + +@iftex +@vindex ERRNO +@cindex differences: @code{gawk} and @code{awk} +@end iftex +@code{close} returns a value of zero if the close succeeded. +Otherwise, the value will be non-zero. +In this case, @code{gawk} sets the variable @code{ERRNO} to a string +describing the error that occurred. + +@node Printing, One-liners, Reading Files, Top +@chapter Printing Output + +@cindex printing +@cindex output +One of the most common things that actions do is to output or @dfn{print} +some or all of the input. For simple output, use the @code{print} +statement. For fancier formatting use the @code{printf} statement. +Both are described in this chapter. + +@menu +* Print:: The @code{print} statement. +* Print Examples:: Simple examples of @code{print} statements. +* Output Separators:: The output separators and how to change them. +* OFMT:: Controlling Numeric Output With @code{print}. +* Printf:: The @code{printf} statement. +* Redirection:: How to redirect output to multiple + files and pipes. +* Special Files:: File name interpretation in @code{gawk}. + @code{gawk} allows access to + inherited file descriptors. +@end menu + +@node Print, Print Examples, Printing, Printing +@section The @code{print} Statement +@cindex @code{print} statement + +The @code{print} statement does output with simple, standardized +formatting. You specify only the strings or numbers to be printed, in a +list separated by commas. They are output, separated by single spaces, +followed by a newline. The statement looks like this: + +@example +print @var{item1}, @var{item2}, @dots{} +@end example + +@noindent +The entire list of items may optionally be enclosed in parentheses. The +parentheses are necessary if any of the item expressions uses a +relational operator; otherwise it could be confused with a redirection +(@pxref{Redirection, ,Redirecting Output of @code{print} and @code{printf}}). +The relational operators are @samp{==}, +@samp{!=}, @samp{<}, @samp{>}, @samp{>=}, @samp{<=}, @samp{~} and +@samp{!~} (@pxref{Comparison Ops, ,Comparison Expressions}).@refill + +The items printed can be constant strings or numbers, fields of the +current record (such as @code{$1}), variables, or any @code{awk} +expressions. The @code{print} statement is completely general for +computing @emph{what} values to print. With two exceptions, +you cannot specify @emph{how} to print them---how many +columns, whether to use exponential notation or not, and so on. +(@xref{Output Separators}, and +@ref{OFMT, ,Controlling Numeric Output with @code{print}}.) +For that, you need the @code{printf} statement +(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}).@refill + +The simple statement @samp{print} with no items is equivalent to +@samp{print $0}: it prints the entire current record. To print a blank +line, use @samp{print ""}, where @code{""} is the null, or empty, +string. + +To print a fixed piece of text, use a string constant such as +@w{@code{"Hello there"}} as one item. If you forget to use the +double-quote characters, your text will be taken as an @code{awk} +expression, and you will probably get an error. Keep in mind that a +space is printed between any two items. + +Most often, each @code{print} statement makes one line of output. But it +isn't limited to one line. If an item value is a string that contains a +newline, the newline is output along with the rest of the string. A +single @code{print} can make any number of lines this way. + +@node Print Examples, Output Separators, Print, Printing +@section Examples of @code{print} Statements + +Here is an example of printing a string that contains embedded newlines: + +@example +awk 'BEGIN @{ print "line one\nline two\nline three" @}' +@end example + +@noindent +produces output like this: + +@example +line one +line two +line three +@end example + +Here is an example that prints the first two fields of each input record, +with a space between them: + +@example +awk '@{ print $1, $2 @}' inventory-shipped +@end example + +@noindent +Its output looks like this: + +@example +Jan 13 +Feb 15 +Mar 15 +@dots{} +@end example + +A common mistake in using the @code{print} statement is to omit the comma +between two items. This often has the effect of making the items run +together in the output, with no space. The reason for this is that +juxtaposing two string expressions in @code{awk} means to concatenate +them. For example, without the comma: + +@example +awk '@{ print $1 $2 @}' inventory-shipped +@end example + +@noindent +prints: + +@example +@group +Jan13 +Feb15 +Mar15 +@dots{} +@end group +@end example + +Neither example's output makes much sense to someone unfamiliar with the +file @file{inventory-shipped}. A heading line at the beginning would make +it clearer. Let's add some headings to our table of months (@code{$1}) and +green crates shipped (@code{$2}). We do this using the @code{BEGIN} pattern +(@pxref{BEGIN/END, ,@code{BEGIN} and @code{END} Special Patterns}) to force the headings to be printed only once: + +@example +awk 'BEGIN @{ print "Month Crates" + print "----- ------" @} + @{ print $1, $2 @}' inventory-shipped +@end example + +@noindent +Did you already guess what happens? This program prints the following: + +@example +@group +Month Crates +----- ------ +Jan 13 +Feb 15 +Mar 15 +@dots{} +@end group +@end example + +@noindent +The headings and the table data don't line up! We can fix this by printing +some spaces between the two fields: + +@example +awk 'BEGIN @{ print "Month Crates" + print "----- ------" @} + @{ print $1, " ", $2 @}' inventory-shipped +@end example + +You can imagine that this way of lining up columns can get pretty +complicated when you have many columns to fix. Counting spaces for two +or three columns can be simple, but more than this and you can get +``lost'' quite easily. This is why the @code{printf} statement was +created (@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}); +one of its specialties is lining up columns of data.@refill + +@node Output Separators, OFMT, Print Examples, Printing +@section Output Separators + +@cindex output field separator, @code{OFS} +@vindex OFS +@vindex ORS +@cindex output record separator, @code{ORS} +As mentioned previously, a @code{print} statement contains a list +of items, separated by commas. In the output, the items are normally +separated by single spaces. But they do not have to be spaces; a +single space is only the default. You can specify any string of +characters to use as the @dfn{output field separator} by setting the +built-in variable @code{OFS}. The initial value of this variable +is the string @w{@code{" "}}, that is, just a single space.@refill + +The output from an entire @code{print} statement is called an +@dfn{output record}. Each @code{print} statement outputs one output +record and then outputs a string called the @dfn{output record separator}. +The built-in variable @code{ORS} specifies this string. The initial +value of the variable is the string @code{"\n"} containing a newline +character; thus, normally each @code{print} statement makes a separate line. + +You can change how output fields and records are separated by assigning +new values to the variables @code{OFS} and/or @code{ORS}. The usual +place to do this is in the @code{BEGIN} rule +(@pxref{BEGIN/END, ,@code{BEGIN} and @code{END} Special Patterns}), so +that it happens before any input is processed. You may also do this +with assignments on the command line, before the names of your input +files.@refill + +The following example prints the first and second fields of each input +record separated by a semicolon, with a blank line added after each +line:@refill + +@example +@group +awk 'BEGIN @{ OFS = ";"; ORS = "\n\n" @} + @{ print $1, $2 @}' BBS-list +@end group +@end example + +If the value of @code{ORS} does not contain a newline, all your output +will be run together on a single line, unless you output newlines some +other way. + +@node OFMT, Printf, Output Separators, Printing +@section Controlling Numeric Output with @code{print} +@vindex OFMT +When you use the @code{print} statement to print numeric values, +@code{awk} internally converts the number to a string of characters, +and prints that string. @code{awk} uses the @code{sprintf} function +to do this conversion. For now, it suffices to say that the @code{sprintf} +function accepts a @dfn{format specification} that tells it how to format +numbers (or strings), and that there are a number of different ways that +numbers can be formatted. The different format specifications are discussed +more fully in +@ref{Printf, ,Using @code{printf} Statements for Fancier Printing}.@refill + +The built-in variable @code{OFMT} contains the default format specification +that @code{print} uses with @code{sprintf} when it wants to convert a +number to a string for printing. By supplying different format specifications +as the value of @code{OFMT}, you can change how @code{print} will print +your numbers. As a brief example: + +@example +@group +awk 'BEGIN @{ OFMT = "%d" # print numbers as integers + print 17.23 @}' +@end group +@end example + +@noindent +will print @samp{17}. + +@node Printf, Redirection, OFMT, Printing +@section Using @code{printf} Statements for Fancier Printing +@cindex formatted output +@cindex output, formatted + +If you want more precise control over the output format than +@code{print} gives you, use @code{printf}. With @code{printf} you can +specify the width to use for each item, and you can specify various +stylistic choices for numbers (such as what radix to use, whether to +print an exponent, whether to print a sign, and how many digits to print +after the decimal point). You do this by specifying a string, called +the @dfn{format string}, which controls how and where to print the other +arguments. + +@menu +* Basic Printf:: Syntax of the @code{printf} statement. +* Control Letters:: Format-control letters. +* Format Modifiers:: Format-specification modifiers. +* Printf Examples:: Several examples. +@end menu + +@node Basic Printf, Control Letters, Printf, Printf +@subsection Introduction to the @code{printf} Statement + +@cindex @code{printf} statement, syntax of +The @code{printf} statement looks like this:@refill + +@example +printf @var{format}, @var{item1}, @var{item2}, @dots{} +@end example + +@noindent +The entire list of arguments may optionally be enclosed in parentheses. The +parentheses are necessary if any of the item expressions uses a +relational operator; otherwise it could be confused with a redirection +(@pxref{Redirection, ,Redirecting Output of @code{print} and @code{printf}}). +The relational operators are @samp{==}, +@samp{!=}, @samp{<}, @samp{>}, @samp{>=}, @samp{<=}, @samp{~} and +@samp{!~} (@pxref{Comparison Ops, ,Comparison Expressions}).@refill + +@cindex format string +The difference between @code{printf} and @code{print} is the argument +@var{format}. This is an expression whose value is taken as a string; it +specifies how to output each of the other arguments. It is called +the @dfn{format string}. + +The format string is the same as in the @sc{ansi} C library function +@code{printf}. Most of @var{format} is text to be output verbatim. +Scattered among this text are @dfn{format specifiers}, one per item. +Each format specifier says to output the next item at that place in the +format.@refill + +The @code{printf} statement does not automatically append a newline to its +output. It outputs only what the format specifies. So if you want +a newline, you must include one in the format. The output separator +variables @code{OFS} and @code{ORS} have no effect on @code{printf} +statements.@refill + +@node Control Letters, Format Modifiers, Basic Printf, Printf +@subsection Format-Control Letters +@cindex @code{printf}, format-control characters +@cindex format specifier + +A format specifier starts with the character @samp{%} and ends with a +@dfn{format-control letter}; it tells the @code{printf} statement how +to output one item. (If you actually want to output a @samp{%}, write +@samp{%%}.) The format-control letter specifies what kind of value to +print. The rest of the format specifier is made up of optional +@dfn{modifiers} which are parameters such as the field width to use.@refill + +Here is a list of the format-control letters: + +@table @samp +@item c +This prints a number as an ASCII character. Thus, @samp{printf "%c", +65} outputs the letter @samp{A}. The output for a string value is +the first character of the string. + +@item d +This prints a decimal integer. + +@item i +This also prints a decimal integer. + +@item e +This prints a number in scientific (exponential) notation. +For example, + +@example +printf "%4.3e", 1950 +@end example + +@noindent +prints @samp{1.950e+03}, with a total of four significant figures of +which three follow the decimal point. The @samp{4.3} are @dfn{modifiers}, +discussed below. + +@item f +This prints a number in floating point notation. + +@item g +This prints a number in either scientific notation or floating point +notation, whichever uses fewer characters. +@ignore +From: gatech!ames!elroy!cit-vax!EQL.Caltech.Edu!rankin (Pat Rankin) + +In the description of printf formats (p.43), the information for %g +is incorrect (mainly, it's too much of an oversimplification). It's +wrong in the AWK book too, and in the gawk man page. I suggested to +David Trueman before 2.13 was released that the latter be revised, so +that it matched gawk's behavior (rather than trying to change gawk to +match the docs ;-). The documented description is nice and simple, but +it doesn't match the actual underlying behavior of %g in the various C +run-time libraries that gawk relies on. The precision value for g format +is different than for f and e formats, so it's inaccurate to say 'g' is +the shorter of 'e' or 'f'. For 'g', precision represents the number of +significant digits rather than the number of decimal places, and it has +special rules about how to format numbers with range between 10E-1 and +10E-4. All in all, it's pretty messy, and I had to add that clumsy +GFMT_WORKAROUND code because the VMS run-time library doesn't conform to +the ANSI-C specifications. +@end ignore + +@item o +This prints an unsigned octal integer. + +@item s +This prints a string. + +@item x +This prints an unsigned hexadecimal integer. + +@item X +This prints an unsigned hexadecimal integer. However, for the values 10 +through 15, it uses the letters @samp{A} through @samp{F} instead of +@samp{a} through @samp{f}. + +@item % +This isn't really a format-control letter, but it does have a meaning +when used after a @samp{%}: the sequence @samp{%%} outputs one +@samp{%}. It does not consume an argument. +@end table + +@node Format Modifiers, Printf Examples, Control Letters, Printf +@subsection Modifiers for @code{printf} Formats + +@cindex @code{printf}, modifiers +@cindex modifiers (in format specifiers) +A format specification can also include @dfn{modifiers} that can control +how much of the item's value is printed and how much space it gets. The +modifiers come between the @samp{%} and the format-control letter. Here +are the possible modifiers, in the order in which they may appear: + +@table @samp +@item - +The minus sign, used before the width modifier, says to left-justify +the argument within its specified width. Normally the argument +is printed right-justified in the specified width. Thus, + +@example +printf "%-4s", "foo" +@end example + +@noindent +prints @samp{foo }. + +@item @var{width} +This is a number representing the desired width of a field. Inserting any +number between the @samp{%} sign and the format control character forces the +field to be expanded to this width. The default way to do this is to +pad with spaces on the left. For example, + +@example +printf "%4s", "foo" +@end example + +@noindent +prints @samp{ foo}. + +The value of @var{width} is a minimum width, not a maximum. If the item +value requires more than @var{width} characters, it can be as wide as +necessary. Thus, + +@example +printf "%4s", "foobar" +@end example + +@noindent +prints @samp{foobar}. + +Preceding the @var{width} with a minus sign causes the output to be +padded with spaces on the right, instead of on the left. + +@item .@var{prec} +This is a number that specifies the precision to use when printing. +This specifies the number of digits you want printed to the right of the +decimal point. For a string, it specifies the maximum number of +characters from the string that should be printed. +@end table + +The C library @code{printf}'s dynamic @var{width} and @var{prec} +capability (for example, @code{"%*.*s"}) is supported. Instead of +supplying explicit @var{width} and/or @var{prec} values in the format +string, you pass them in the argument list. For example:@refill + +@example +w = 5 +p = 3 +s = "abcdefg" +printf "<%*.*s>\n", w, p, s +@end example + +@noindent +is exactly equivalent to + +@example +s = "abcdefg" +printf "<%5.3s>\n", s +@end example + +@noindent +Both programs output @samp{@w{<@bullet{}@bullet{}abc>}}. (We have +used the bullet symbol ``@bullet{}'' to represent a space, to clearly +show you that there are two spaces in the output.)@refill + +Earlier versions of @code{awk} did not support this capability. You may +simulate it by using concatenation to build up the format string, +like so:@refill + +@example +w = 5 +p = 3 +s = "abcdefg" +printf "<%" w "." p "s>\n", s +@end example + +@noindent +This is not particularly easy to read, however. + +@node Printf Examples, , Format Modifiers, Printf +@subsection Examples of Using @code{printf} + +Here is how to use @code{printf} to make an aligned table: + +@example +awk '@{ printf "%-10s %s\n", $1, $2 @}' BBS-list +@end example + +@noindent +prints the names of bulletin boards (@code{$1}) of the file +@file{BBS-list} as a string of 10 characters, left justified. It also +prints the phone numbers (@code{$2}) afterward on the line. This +produces an aligned two-column table of names and phone numbers:@refill + +@example +@group +aardvark 555-5553 +alpo-net 555-3412 +barfly 555-7685 +bites 555-1675 +camelot 555-0542 +core 555-2912 +fooey 555-1234 +foot 555-6699 +macfoo 555-6480 +sdace 555-3430 +sabafoo 555-2127 +@end group +@end example + +Did you notice that we did not specify that the phone numbers be printed +as numbers? They had to be printed as strings because the numbers are +separated by a dash. This dash would be interpreted as a minus sign if +we had tried to print the phone numbers as numbers. This would have led +to some pretty confusing results. + +We did not specify a width for the phone numbers because they are the +last things on their lines. We don't need to put spaces after them. + +We could make our table look even nicer by adding headings to the tops +of the columns. To do this, use the @code{BEGIN} pattern +(@pxref{BEGIN/END, ,@code{BEGIN} and @code{END} Special Patterns}) +to force the header to be printed only once, at the beginning of +the @code{awk} program:@refill + +@example +@group +awk 'BEGIN @{ print "Name Number" + print "---- ------" @} + @{ printf "%-10s %s\n", $1, $2 @}' BBS-list +@end group +@end example + +Did you notice that we mixed @code{print} and @code{printf} statements in +the above example? We could have used just @code{printf} statements to get +the same results: + +@example +@group +awk 'BEGIN @{ printf "%-10s %s\n", "Name", "Number" + printf "%-10s %s\n", "----", "------" @} + @{ printf "%-10s %s\n", $1, $2 @}' BBS-list +@end group +@end example + +@noindent +By outputting each column heading with the same format specification +used for the elements of the column, we have made sure that the headings +are aligned just like the columns. + +The fact that the same format specification is used three times can be +emphasized by storing it in a variable, like this: + +@example +awk 'BEGIN @{ format = "%-10s %s\n" + printf format, "Name", "Number" + printf format, "----", "------" @} + @{ printf format, $1, $2 @}' BBS-list +@end example + +See if you can use the @code{printf} statement to line up the headings and +table data for our @file{inventory-shipped} example covered earlier in the +section on the @code{print} statement +(@pxref{Print, ,The @code{print} Statement}).@refill + +@node Redirection, Special Files, Printf, Printing +@section Redirecting Output of @code{print} and @code{printf} + +@cindex output redirection +@cindex redirection of output +So far we have been dealing only with output that prints to the standard +output, usually your terminal. Both @code{print} and @code{printf} can +also send their output to other places. +This is called @dfn{redirection}.@refill + +A redirection appears after the @code{print} or @code{printf} statement. +Redirections in @code{awk} are written just like redirections in shell +commands, except that they are written inside the @code{awk} program. + +@menu +* File/Pipe Redirection:: Redirecting Output to Files and Pipes. +* Close Output:: How to close output files and pipes. +@end menu + +@node File/Pipe Redirection, Close Output, Redirection, Redirection +@subsection Redirecting Output to Files and Pipes + +Here are the three forms of output redirection. They are all shown for +the @code{print} statement, but they work identically for @code{printf} +also.@refill + +@table @code +@item print @var{items} > @var{output-file} +This type of redirection prints the items onto the output file +@var{output-file}. The file name @var{output-file} can be any +expression. Its value is changed to a string and then used as a +file name (@pxref{Expressions, ,Expressions as Action Statements}).@refill + +When this type of redirection is used, the @var{output-file} is erased +before the first output is written to it. Subsequent writes do not +erase @var{output-file}, but append to it. If @var{output-file} does +not exist, then it is created.@refill + +For example, here is how one @code{awk} program can write a list of +BBS names to a file @file{name-list} and a list of phone numbers to a +file @file{phone-list}. Each output file contains one name or number +per line. + +@smallexample +awk '@{ print $2 > "phone-list" + print $1 > "name-list" @}' BBS-list +@end smallexample + +@item print @var{items} >> @var{output-file} +This type of redirection prints the items onto the output file +@var{output-file}. The difference between this and the +single-@samp{>} redirection is that the old contents (if any) of +@var{output-file} are not erased. Instead, the @code{awk} output is +appended to the file. + +@cindex pipes for output +@cindex output, piping +@item print @var{items} | @var{command} +It is also possible to send output through a @dfn{pipe} instead of into a +file. This type of redirection opens a pipe to @var{command} and writes +the values of @var{items} through this pipe, to another process created +to execute @var{command}.@refill + +The redirection argument @var{command} is actually an @code{awk} +expression. Its value is converted to a string, whose contents give the +shell command to be run. + +For example, this produces two files, one unsorted list of BBS names +and one list sorted in reverse alphabetical order: + +@smallexample +awk '@{ print $1 > "names.unsorted" + print $1 | "sort -r > names.sorted" @}' BBS-list +@end smallexample + +Here the unsorted list is written with an ordinary redirection while +the sorted list is written by piping through the @code{sort} utility. + +Here is an example that uses redirection to mail a message to a mailing +list @samp{bug-system}. This might be useful when trouble is encountered +in an @code{awk} script run periodically for system maintenance. + +@smallexample +report = "mail bug-system" +print "Awk script failed:", $0 | report +print "at record number", FNR, "of", FILENAME | report +close(report) +@end smallexample + +We call the @code{close} function here because it's a good idea to close +the pipe as soon as all the intended output has been sent to it. +@xref{Close Output, ,Closing Output Files and Pipes}, for more information +on this. This example also illustrates the use of a variable to represent +a @var{file} or @var{command}: it is not necessary to always +use a string constant. Using a variable is generally a good idea, +since @code{awk} requires you to spell the string value identically +every time. +@end table + +Redirecting output using @samp{>}, @samp{>>}, or @samp{|} asks the system +to open a file or pipe only if the particular @var{file} or @var{command} +you've specified has not already been written to by your program, or if +it has been closed since it was last written to.@refill + +@node Close Output, , File/Pipe Redirection, Redirection +@subsection Closing Output Files and Pipes +@cindex closing output files and pipes +@findex close + +When a file or pipe is opened, the file name or command associated with +it is remembered by @code{awk} and subsequent writes to the same file or +command are appended to the previous writes. The file or pipe stays +open until @code{awk} exits. This is usually convenient. + +Sometimes there is a reason to close an output file or pipe earlier +than that. To do this, use the @code{close} function, as follows: + +@example +close(@var{filename}) +@end example + +@noindent +or + +@example +close(@var{command}) +@end example + +The argument @var{filename} or @var{command} can be any expression. +Its value must exactly equal the string used to open the file or pipe +to begin with---for example, if you open a pipe with this: + +@example +print $1 | "sort -r > names.sorted" +@end example + +@noindent +then you must close it with this: + +@example +close("sort -r > names.sorted") +@end example + +Here are some reasons why you might need to close an output file: + +@itemize @bullet +@item +To write a file and read it back later on in the same @code{awk} +program. Close the file when you are finished writing it; then +you can start reading it with @code{getline} +(@pxref{Getline, ,Explicit Input with @code{getline}}).@refill + +@item +To write numerous files, successively, in the same @code{awk} +program. If you don't close the files, eventually you may exceed a +system limit on the number of open files in one process. So close +each one when you are finished writing it. + +@item +To make a command finish. When you redirect output through a pipe, +the command reading the pipe normally continues to try to read input +as long as the pipe is open. Often this means the command cannot +really do its work until the pipe is closed. For example, if you +redirect output to the @code{mail} program, the message is not +actually sent until the pipe is closed. + +@item +To run the same program a second time, with the same arguments. +This is not the same thing as giving more input to the first run! + +For example, suppose you pipe output to the @code{mail} program. If you +output several lines redirected to this pipe without closing it, they make +a single message of several lines. By contrast, if you close the pipe +after each line of output, then each line makes a separate message. +@end itemize + +@iftex +@vindex ERRNO +@cindex differences: @code{gawk} and @code{awk} +@end iftex +@code{close} returns a value of zero if the close succeeded. +Otherwise, the value will be non-zero. +In this case, @code{gawk} sets the variable @code{ERRNO} to a string +describing the error that occurred. + +@node Special Files, , Redirection, Printing +@section Standard I/O Streams +@cindex standard input +@cindex standard output +@cindex standard error output +@cindex file descriptors + +Running programs conventionally have three input and output streams +already available to them for reading and writing. These are known as +the @dfn{standard input}, @dfn{standard output}, and @dfn{standard error +output}. These streams are, by default, terminal input and output, but +they are often redirected with the shell, via the @samp{<}, @samp{<<}, +@samp{>}, @samp{>>}, @samp{>&} and @samp{|} operators. Standard error +is used only for writing error messages; the reason we have two separate +streams, standard output and standard error, is so that they can be +redirected separately. + +@iftex +@cindex differences: @code{gawk} and @code{awk} +@end iftex +In other implementations of @code{awk}, the only way to write an error +message to standard error in an @code{awk} program is as follows: + +@smallexample +print "Serious error detected!\n" | "cat 1>&2" +@end smallexample + +@noindent +This works by opening a pipeline to a shell command which can access the +standard error stream which it inherits from the @code{awk} process. +This is far from elegant, and is also inefficient, since it requires a +separate process. So people writing @code{awk} programs have often +neglected to do this. Instead, they have sent the error messages to the +terminal, like this: + +@smallexample +@group +NF != 4 @{ + printf("line %d skipped: doesn't have 4 fields\n", FNR) > "/dev/tty" +@} +@end group +@end smallexample + +@noindent +This has the same effect most of the time, but not always: although the +standard error stream is usually the terminal, it can be redirected, and +when that happens, writing to the terminal is not correct. In fact, if +@code{awk} is run from a background job, it may not have a terminal at all. +Then opening @file{/dev/tty} will fail. + +@code{gawk} provides special file names for accessing the three standard +streams. When you redirect input or output in @code{gawk}, if the file name +matches one of these special names, then @code{gawk} directly uses the +stream it stands for. + +@cindex @file{/dev/stdin} +@cindex @file{/dev/stdout} +@cindex @file{/dev/stderr} +@cindex @file{/dev/fd/} +@table @file +@item /dev/stdin +The standard input (file descriptor 0). + +@item /dev/stdout +The standard output (file descriptor 1). + +@item /dev/stderr +The standard error output (file descriptor 2). + +@item /dev/fd/@var{N} +The file associated with file descriptor @var{N}. Such a file must have +been opened by the program initiating the @code{awk} execution (typically +the shell). Unless you take special pains, only descriptors 0, 1 and 2 +are available. +@end table + +The file names @file{/dev/stdin}, @file{/dev/stdout}, and @file{/dev/stderr} +are aliases for @file{/dev/fd/0}, @file{/dev/fd/1}, and @file{/dev/fd/2}, +respectively, but they are more self-explanatory. + +The proper way to write an error message in a @code{gawk} program +is to use @file{/dev/stderr}, like this: + +@smallexample +NF != 4 @{ + printf("line %d skipped: doesn't have 4 fields\n", FNR) > "/dev/stderr" +@} +@end smallexample + +@code{gawk} also provides special file names that give access to information +about the running @code{gawk} process. Each of these ``files'' provides +a single record of information. To read them more than once, you must +first close them with the @code{close} function +(@pxref{Close Input, ,Closing Input Files and Pipes}). +The filenames are: + +@cindex @file{/dev/pid} +@cindex @file{/dev/pgrpid} +@cindex @file{/dev/ppid} +@cindex @file{/dev/user} +@table @file +@item /dev/pid +Reading this file returns the process ID of the current process, +in decimal, terminated with a newline. + +@item /dev/ppid +Reading this file returns the parent process ID of the current process, +in decimal, terminated with a newline. + +@item /dev/pgrpid +Reading this file returns the process group ID of the current process, +in decimal, terminated with a newline. + +@item /dev/user +Reading this file returns a single record terminated with a newline. +The fields are separated with blanks. The fields represent the +following information: + +@table @code +@item $1 +The value of the @code{getuid} system call. + +@item $2 +The value of the @code{geteuid} system call. + +@item $3 +The value of the @code{getgid} system call. + +@item $4 +The value of the @code{getegid} system call. +@end table + +If there are any additional fields, they are the group IDs returned by +@code{getgroups} system call. +(Multiple groups may not be supported on all systems.)@refill +@end table + +These special file names may be used on the command line as data +files, as well as for I/O redirections within an @code{awk} program. +They may not be used as source files with the @samp{-f} option. + +Recognition of these special file names is disabled if @code{gawk} is in +compatibility mode (@pxref{Command Line, ,Invoking @code{awk}}). + +@quotation +@strong{Caution}: Unless your system actually has a @file{/dev/fd} directory +(or any of the other above listed special files), +the interpretation of these file names is done by @code{gawk} itself. +For example, using @samp{/dev/fd/4} for output will actually write on +file descriptor 4, and not on a new file descriptor that was @code{dup}'ed +from file descriptor 4. Most of the time this does not matter; however, it +is important to @emph{not} close any of the files related to file descriptors +0, 1, and 2. If you do close one of these files, unpredictable behavior +will result. +@end quotation + +@node One-liners, Patterns, Printing, Top +@chapter Useful ``One-liners'' + +@cindex one-liners +Useful @code{awk} programs are often short, just a line or two. Here is a +collection of useful, short programs to get you started. Some of these +programs contain constructs that haven't been covered yet. The description +of the program will give you a good idea of what is going on, but please +read the rest of the manual to become an @code{awk} expert! + +@c Per suggestions from Michal Jaegermann +@ifinfo +Since you are reading this in Info, each line of the example code is +enclosed in quotes, to represent text that you would type literally. +The examples themselves represent shell commands that use single quotes +to keep the shell from interpreting the contents of the program. +When reading the examples, focus on the text between the open and close +quotes. +@end ifinfo + +@table @code +@item awk '@{ if (NF > max) max = NF @} +@itemx @ @ @ @ @ END @{ print max @}' +This program prints the maximum number of fields on any input line. + +@item awk 'length($0) > 80' +This program prints every line longer than 80 characters. The sole +rule has a relational expression as its pattern, and has no action (so the +default action, printing the record, is used). + +@item awk 'NF > 0' +This program prints every line that has at least one field. This is an +easy way to delete blank lines from a file (or rather, to create a new +file similar to the old file but from which the blank lines have been +deleted). + +@item awk '@{ if (NF > 0) print @}' +This program also prints every line that has at least one field. Here we +allow the rule to match every line, then decide in the action whether +to print. + +@item awk@ 'BEGIN@ @{@ for (i = 1; i <= 7; i++) +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ print int(101 * rand()) @}' +This program prints 7 random numbers from 0 to 100, inclusive. + +@item ls -l @var{files} | awk '@{ x += $4 @} ; END @{ print "total bytes: " x @}' +This program prints the total number of bytes used by @var{files}. + +@item expand@ @var{file}@ |@ awk@ '@{ if (x < length()) x = length() @} +@itemx @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ @ END @{ print "maximum line length is " x @}' +This program prints the maximum line length of @var{file}. The input +is piped through the @code{expand} program to change tabs into spaces, +so the widths compared are actually the right-margin columns. + +@item awk 'BEGIN @{ FS = ":" @} +@itemx @ @ @ @ @ @{ print $1 | "sort" @}' /etc/passwd +This program prints a sorted list of the login names of all users. + +@item awk '@{ nlines++ @} +@itemx @ @ @ @ @ END@ @{ print nlines @}' +This programs counts lines in a file. + +@item awk 'END @{ print NR @}' +This program also counts lines in a file, but lets @code{awk} do the work. + +@item awk '@{ print NR, $0 @}' +This program adds line numbers to all its input files, +similar to @samp{cat -n}. +@end table + +@node Patterns, Actions, One-liners, Top +@chapter Patterns +@cindex pattern, definition of + +Patterns in @code{awk} control the execution of rules: a rule is +executed when its pattern matches the current input record. This +chapter tells all about how to write patterns. + +@menu +* Kinds of Patterns:: A list of all kinds of patterns. + The following subsections describe + them in detail. +* Regexp:: Regular expressions such as @samp{/foo/}. +* Comparison Patterns:: Comparison expressions such as @code{$1 > 10}. +* Boolean Patterns:: Combining comparison expressions. +* Expression Patterns:: Any expression can be used as a pattern. +* Ranges:: Pairs of patterns specify record ranges. +* BEGIN/END:: Specifying initialization and cleanup rules. +* Empty:: The empty pattern, which matches every record. +@end menu + +@node Kinds of Patterns, Regexp, Patterns, Patterns +@section Kinds of Patterns +@cindex patterns, types of + +Here is a summary of the types of patterns supported in @code{awk}. +@c At the next rewrite, check to see that this order matches the +@c order in the text. It might not matter to a reader, but it's good +@c style. Also, it might be nice to mention all the topics of sections +@c that follow in this list; that way people can scan and know when to +@c expect a specific topic. Specifically please also make an entry +@c for Boolean operators as patterns in the right place. --mew + +@table @code +@item /@var{regular expression}/ +A regular expression as a pattern. It matches when the text of the +input record fits the regular expression. +(@xref{Regexp, ,Regular Expressions as Patterns}.)@refill + +@item @var{expression} +A single expression. It matches when its value, converted to a number, +is nonzero (if a number) or nonnull (if a string). +(@xref{Expression Patterns, ,Expressions as Patterns}.)@refill + +@item @var{pat1}, @var{pat2} +A pair of patterns separated by a comma, specifying a range of records. +(@xref{Ranges, ,Specifying Record Ranges with Patterns}.) + +@item BEGIN +@itemx END +Special patterns to supply start-up or clean-up information to +@code{awk}. (@xref{BEGIN/END, ,@code{BEGIN} and @code{END} Special Patterns}.) + +@item @var{null} +The empty pattern matches every input record. +(@xref{Empty, ,The Empty Pattern}.)@refill +@end table + + +@node Regexp, Comparison Patterns, Kinds of Patterns, Patterns +@section Regular Expressions as Patterns +@cindex pattern, regular expressions +@cindex regexp +@cindex regular expressions as patterns + +A @dfn{regular expression}, or @dfn{regexp}, is a way of describing a +class of strings. A regular expression enclosed in slashes (@samp{/}) +is an @code{awk} pattern that matches every input record whose text +belongs to that class. + +The simplest regular expression is a sequence of letters, numbers, or +both. Such a regexp matches any string that contains that sequence. +Thus, the regexp @samp{foo} matches any string containing @samp{foo}. +Therefore, the pattern @code{/foo/} matches any input record containing +@samp{foo}. Other kinds of regexps let you specify more complicated +classes of strings. + +@menu +* Regexp Usage:: How to Use Regular Expressions +* Regexp Operators:: Regular Expression Operators +* Case-sensitivity:: How to do case-insensitive matching. +@end menu + +@node Regexp Usage, Regexp Operators, Regexp, Regexp +@subsection How to Use Regular Expressions + +A regular expression can be used as a pattern by enclosing it in +slashes. Then the regular expression is matched against the +entire text of each record. (Normally, it only needs +to match some part of the text in order to succeed.) For example, this +prints the second field of each record that contains @samp{foo} anywhere: + +@example +awk '/foo/ @{ print $2 @}' BBS-list +@end example + +@cindex regular expression matching operators +@cindex string-matching operators +@cindex operators, string-matching +@cindex operators, regexp matching +@cindex regexp search operators +Regular expressions can also be used in comparison expressions. Then +you can specify the string to match against; it need not be the entire +current input record. These comparison expressions can be used as +patterns or in @code{if}, @code{while}, @code{for}, and @code{do} statements. + +@table @code +@item @var{exp} ~ /@var{regexp}/ +This is true if the expression @var{exp} (taken as a character string) +is matched by @var{regexp}. The following example matches, or selects, +all input records with the upper-case letter @samp{J} somewhere in the +first field:@refill + +@example +awk '$1 ~ /J/' inventory-shipped +@end example + +So does this: + +@example +awk '@{ if ($1 ~ /J/) print @}' inventory-shipped +@end example + +@item @var{exp} !~ /@var{regexp}/ +This is true if the expression @var{exp} (taken as a character string) +is @emph{not} matched by @var{regexp}. The following example matches, +or selects, all input records whose first field @emph{does not} contain +the upper-case letter @samp{J}:@refill + +@example +awk '$1 !~ /J/' inventory-shipped +@end example +@end table + +@cindex computed regular expressions +@cindex regular expressions, computed +@cindex dynamic regular expressions +The right hand side of a @samp{~} or @samp{!~} operator need not be a +constant regexp (i.e., a string of characters between slashes). It may +be any expression. The expression is evaluated, and converted if +necessary to a string; the contents of the string are used as the +regexp. A regexp that is computed in this way is called a @dfn{dynamic +regexp}. For example: + +@example +identifier_regexp = "[A-Za-z_][A-Za-z_0-9]+" +$0 ~ identifier_regexp +@end example + +@noindent +sets @code{identifier_regexp} to a regexp that describes @code{awk} +variable names, and tests if the input record matches this regexp. + +@node Regexp Operators, Case-sensitivity, Regexp Usage, Regexp +@subsection Regular Expression Operators +@cindex metacharacters +@cindex regular expression metacharacters + +You can combine regular expressions with the following characters, +called @dfn{regular expression operators}, or @dfn{metacharacters}, to +increase the power and versatility of regular expressions. + +Here is a table of metacharacters. All characters not listed in the +table stand for themselves. + +@table @code +@item ^ +This matches the beginning of the string or the beginning of a line +within the string. For example: + +@example +^@@chapter +@end example + +@noindent +matches the @samp{@@chapter} at the beginning of a string, and can be used +to identify chapter beginnings in Texinfo source files. + +@item $ +This is similar to @samp{^}, but it matches only at the end of a string +or the end of a line within the string. For example: + +@example +p$ +@end example + +@noindent +matches a record that ends with a @samp{p}. + +@item . +This matches any single character except a newline. For example: + +@example +.P +@end example + +@noindent +matches any single character followed by a @samp{P} in a string. Using +concatenation we can make regular expressions like @samp{U.A}, which +matches any three-character sequence that begins with @samp{U} and ends +with @samp{A}. + +@item [@dots{}] +This is called a @dfn{character set}. It matches any one of the +characters that are enclosed in the square brackets. For example: + +@example +[MVX] +@end example + +@noindent +matches any one of the characters @samp{M}, @samp{V}, or @samp{X} in a +string.@refill + +Ranges of characters are indicated by using a hyphen between the beginning +and ending characters, and enclosing the whole thing in brackets. For +example:@refill + +@example +[0-9] +@end example + +@noindent +matches any digit. + +To include the character @samp{\}, @samp{]}, @samp{-} or @samp{^} in a +character set, put a @samp{\} in front of it. For example: + +@example +[d\]] +@end example + +@noindent +matches either @samp{d}, or @samp{]}.@refill + +This treatment of @samp{\} is compatible with other @code{awk} +implementations, and is also mandated by the @sc{posix} Command Language +and Utilities standard. The regular expressions in @code{awk} are a superset +of the @sc{posix} specification for Extended Regular Expressions (EREs). +@sc{posix} EREs are based on the regular expressions accepted by the +traditional @code{egrep} utility. + +In @code{egrep} syntax, backslash is not syntactically special within +square brackets. This means that special tricks have to be used to +represent the characters @samp{]}, @samp{-} and @samp{^} as members of a +character set. + +In @code{egrep} syntax, to match @samp{-}, write it as @samp{---}, +which is a range containing only @w{@samp{-}.} You may also give @samp{-} +as the first or last character in the set. To match @samp{^}, put it +anywhere except as the first character of a set. To match a @samp{]}, +make it the first character in the set. For example:@refill + +@example +[]d^] +@end example + +@noindent +matches either @samp{]}, @samp{d} or @samp{^}.@refill + +@item [^ @dots{}] +This is a @dfn{complemented character set}. The first character after +the @samp{[} @emph{must} be a @samp{^}. It matches any characters +@emph{except} those in the square brackets (or newline). For example: + +@example +[^0-9] +@end example + +@noindent +matches any character that is not a digit. + +@item | +This is the @dfn{alternation operator} and it is used to specify +alternatives. For example: + +@example +^P|[0-9] +@end example + +@noindent +matches any string that matches either @samp{^P} or @samp{[0-9]}. This +means it matches any string that contains a digit or starts with @samp{P}. + +The alternation applies to the largest possible regexps on either side. +@item (@dots{}) +Parentheses are used for grouping in regular expressions as in +arithmetic. They can be used to concatenate regular expressions +containing the alternation operator, @samp{|}. + +@item * +This symbol means that the preceding regular expression is to be +repeated as many times as possible to find a match. For example: + +@example +ph* +@end example + +@noindent +applies the @samp{*} symbol to the preceding @samp{h} and looks for matches +to one @samp{p} followed by any number of @samp{h}s. This will also match +just @samp{p} if no @samp{h}s are present. + +The @samp{*} repeats the @emph{smallest} possible preceding expression. +(Use parentheses if you wish to repeat a larger expression.) It finds +as many repetitions as possible. For example: + +@example +awk '/\(c[ad][ad]*r x\)/ @{ print @}' sample +@end example + +@noindent +prints every record in the input containing a string of the form +@samp{(car x)}, @samp{(cdr x)}, @samp{(cadr x)}, and so on.@refill + +@item + +This symbol is similar to @samp{*}, but the preceding expression must be +matched at least once. This means that: + +@example +wh+y +@end example + +@noindent +would match @samp{why} and @samp{whhy} but not @samp{wy}, whereas +@samp{wh*y} would match all three of these strings. This is a simpler +way of writing the last @samp{*} example: + +@example +awk '/\(c[ad]+r x\)/ @{ print @}' sample +@end example + +@item ? +This symbol is similar to @samp{*}, but the preceding expression can be +matched once or not at all. For example: + +@example +fe?d +@end example + +@noindent +will match @samp{fed} and @samp{fd}, but nothing else.@refill + +@item \ +This is used to suppress the special meaning of a character when +matching. For example: + +@example +\$ +@end example + +@noindent +matches the character @samp{$}. + +The escape sequences used for string constants +(@pxref{Constants, ,Constant Expressions}) are +valid in regular expressions as well; they are also introduced by a +@samp{\}.@refill +@end table + +In regular expressions, the @samp{*}, @samp{+}, and @samp{?} operators have +the highest precedence, followed by concatenation, and finally by @samp{|}. +As in arithmetic, parentheses can change how operators are grouped.@refill + +@node Case-sensitivity, , Regexp Operators, Regexp +@subsection Case-sensitivity in Matching + +Case is normally significant in regular expressions, both when matching +ordinary characters (i.e., not metacharacters), and inside character +sets. Thus a @samp{w} in a regular expression matches only a lower case +@samp{w} and not an upper case @samp{W}. + +The simplest way to do a case-independent match is to use a character +set: @samp{[Ww]}. However, this can be cumbersome if you need to use it +often; and it can make the regular expressions harder for humans to +read. There are two other alternatives that you might prefer. + +One way to do a case-insensitive match at a particular point in the +program is to convert the data to a single case, using the +@code{tolower} or @code{toupper} built-in string functions (which we +haven't discussed yet; +@pxref{String Functions, ,Built-in Functions for String Manipulation}). +For example:@refill + +@example +tolower($1) ~ /foo/ @{ @dots{} @} +@end example + +@noindent +converts the first field to lower case before matching against it. + +Another method is to set the variable @code{IGNORECASE} to a nonzero +value (@pxref{Built-in Variables}). When @code{IGNORECASE} is not zero, +@emph{all} regexp operations ignore case. Changing the value of +@code{IGNORECASE} dynamically controls the case sensitivity of your +program as it runs. Case is significant by default because +@code{IGNORECASE} (like most variables) is initialized to zero. + +@example +x = "aB" +if (x ~ /ab/) @dots{} # this test will fail + +IGNORECASE = 1 +if (x ~ /ab/) @dots{} # now it will succeed +@end example + +In general, you cannot use @code{IGNORECASE} to make certain rules +case-insensitive and other rules case-sensitive, because there is no way +to set @code{IGNORECASE} just for the pattern of a particular rule. To +do this, you must use character sets or @code{tolower}. However, one +thing you can do only with @code{IGNORECASE} is turn case-sensitivity on +or off dynamically for all the rules at once.@refill + +@code{IGNORECASE} can be set on the command line, or in a @code{BEGIN} +rule. Setting @code{IGNORECASE} from the command line is a way to make +a program case-insensitive without having to edit it. + +The value of @code{IGNORECASE} has no effect if @code{gawk} is in +compatibility mode (@pxref{Command Line, ,Invoking @code{awk}}). +Case is always significant in compatibility mode.@refill + +@node Comparison Patterns, Boolean Patterns, Regexp, Patterns +@section Comparison Expressions as Patterns +@cindex comparison expressions as patterns +@cindex pattern, comparison expressions +@cindex relational operators +@cindex operators, relational + +@dfn{Comparison patterns} test relationships such as equality between +two strings or numbers. They are a special case of expression patterns +(@pxref{Expression Patterns, ,Expressions as Patterns}). They are written +with @dfn{relational operators}, which are a superset of those in C. +Here is a table of them:@refill + +@table @code +@item @var{x} < @var{y} +True if @var{x} is less than @var{y}. + +@item @var{x} <= @var{y} +True if @var{x} is less than or equal to @var{y}. + +@item @var{x} > @var{y} +True if @var{x} is greater than @var{y}. + +@item @var{x} >= @var{y} +True if @var{x} is greater than or equal to @var{y}. + +@item @var{x} == @var{y} +True if @var{x} is equal to @var{y}. + +@item @var{x} != @var{y} +True if @var{x} is not equal to @var{y}. + +@item @var{x} ~ @var{y} +True if @var{x} matches the regular expression described by @var{y}. + +@item @var{x} !~ @var{y} +True if @var{x} does not match the regular expression described by @var{y}. +@end table + +The operands of a relational operator are compared as numbers if they +are both numbers. Otherwise they are converted to, and compared as, +strings (@pxref{Conversion, ,Conversion of Strings and Numbers}, +for the detailed rules). Strings are compared by comparing the first +character of each, then the second character of each, +and so on, until there is a difference. If the two strings are equal until +the shorter one runs out, the shorter one is considered to be less than the +longer one. Thus, @code{"10"} is less than @code{"9"}, and @code{"abc"} +is less than @code{"abcd"}.@refill + +The left operand of the @samp{~} and @samp{!~} operators is a string. +The right operand is either a constant regular expression enclosed in +slashes (@code{/@var{regexp}/}), or any expression, whose string value +is used as a dynamic regular expression +(@pxref{Regexp Usage, ,How to Use Regular Expressions}).@refill + +The following example prints the second field of each input record +whose first field is precisely @samp{foo}. + +@example +awk '$1 == "foo" @{ print $2 @}' BBS-list +@end example + +@noindent +Contrast this with the following regular expression match, which would +accept any record with a first field that contains @samp{foo}: + +@example +awk '$1 ~ "foo" @{ print $2 @}' BBS-list +@end example + +@noindent +or, equivalently, this one: + +@example +awk '$1 ~ /foo/ @{ print $2 @}' BBS-list +@end example + +@node Boolean Patterns, Expression Patterns, Comparison Patterns, Patterns +@section Boolean Operators and Patterns +@cindex patterns, boolean +@cindex boolean patterns + +A @dfn{boolean pattern} is an expression which combines other patterns +using the @dfn{boolean operators} ``or'' (@samp{||}), ``and'' +(@samp{&&}), and ``not'' (@samp{!}). Whether the boolean pattern +matches an input record depends on whether its subpatterns match. + +For example, the following command prints all records in the input file +@file{BBS-list} that contain both @samp{2400} and @samp{foo}.@refill + +@example +awk '/2400/ && /foo/' BBS-list +@end example + +The following command prints all records in the input file +@file{BBS-list} that contain @emph{either} @samp{2400} or @samp{foo}, or +both.@refill + +@example +awk '/2400/ || /foo/' BBS-list +@end example + +The following command prints all records in the input file +@file{BBS-list} that do @emph{not} contain the string @samp{foo}. + +@example +awk '! /foo/' BBS-list +@end example + +Note that boolean patterns are a special case of expression patterns +(@pxref{Expression Patterns, ,Expressions as Patterns}); they are +expressions that use the boolean operators. +@xref{Boolean Ops, ,Boolean Expressions}, for complete information +on the boolean operators.@refill + +The subpatterns of a boolean pattern can be constant regular +expressions, comparisons, or any other @code{awk} expressions. Range +patterns are not expressions, so they cannot appear inside boolean +patterns. Likewise, the special patterns @code{BEGIN} and @code{END}, +which never match any input record, are not expressions and cannot +appear inside boolean patterns. + +@node Expression Patterns, Ranges, Boolean Patterns, Patterns +@section Expressions as Patterns + +Any @code{awk} expression is also valid as an @code{awk} pattern. +Then the pattern ``matches'' if the expression's value is nonzero (if a +number) or nonnull (if a string). + +The expression is reevaluated each time the rule is tested against a new +input record. If the expression uses fields such as @code{$1}, the +value depends directly on the new input record's text; otherwise, it +depends only on what has happened so far in the execution of the +@code{awk} program, but that may still be useful. + +Comparison patterns are actually a special case of this. For +example, the expression @code{$5 == "foo"} has the value 1 when the +value of @code{$5} equals @code{"foo"}, and 0 otherwise; therefore, this +expression as a pattern matches when the two values are equal. + +Boolean patterns are also special cases of expression patterns. + +A constant regexp as a pattern is also a special case of an expression +pattern. @code{/foo/} as an expression has the value 1 if @samp{foo} +appears in the current input record; thus, as a pattern, @code{/foo/} +matches any record containing @samp{foo}. + +Other implementations of @code{awk} that are not yet @sc{posix} compliant +are less general than @code{gawk}: they allow comparison expressions, and +boolean combinations thereof (optionally with parentheses), but not +necessarily other kinds of expressions. + +@node Ranges, BEGIN/END, Expression Patterns, Patterns +@section Specifying Record Ranges with Patterns + +@cindex range pattern +@cindex patterns, range +A @dfn{range pattern} is made of two patterns separated by a comma, of +the form @code{@var{begpat}, @var{endpat}}. It matches ranges of +consecutive input records. The first pattern @var{begpat} controls +where the range begins, and the second one @var{endpat} controls where +it ends. For example,@refill + +@example +awk '$1 == "on", $1 == "off"' +@end example + +@noindent +prints every record between @samp{on}/@samp{off} pairs, inclusive. + +A range pattern starts out by matching @var{begpat} +against every input record; when a record matches @var{begpat}, the +range pattern becomes @dfn{turned on}. The range pattern matches this +record. As long as it stays turned on, it automatically matches every +input record read. It also matches @var{endpat} against +every input record; when that succeeds, the range pattern is turned +off again for the following record. Now it goes back to checking +@var{begpat} against each record. + +The record that turns on the range pattern and the one that turns it +off both match the range pattern. If you don't want to operate on +these records, you can write @code{if} statements in the rule's action +to distinguish them. + +It is possible for a pattern to be turned both on and off by the same +record, if both conditions are satisfied by that record. Then the action is +executed for just that record. + +@node BEGIN/END, Empty, Ranges, Patterns +@section @code{BEGIN} and @code{END} Special Patterns + +@cindex @code{BEGIN} special pattern +@cindex patterns, @code{BEGIN} +@cindex @code{END} special pattern +@cindex patterns, @code{END} +@code{BEGIN} and @code{END} are special patterns. They are not used to +match input records. Rather, they are used for supplying start-up or +clean-up information to your @code{awk} script. A @code{BEGIN} rule is +executed, once, before the first input record has been read. An @code{END} +rule is executed, once, after all the input has been read. For +example:@refill + +@example +awk 'BEGIN @{ print "Analysis of `foo'" @} + /foo/ @{ ++foobar @} + END @{ print "`foo' appears " foobar " times." @}' BBS-list +@end example + +This program finds the number of records in the input file @file{BBS-list} +that contain the string @samp{foo}. The @code{BEGIN} rule prints a title +for the report. There is no need to use the @code{BEGIN} rule to +initialize the counter @code{foobar} to zero, as @code{awk} does this +for us automatically (@pxref{Variables}). + +The second rule increments the variable @code{foobar} every time a +record containing the pattern @samp{foo} is read. The @code{END} rule +prints the value of @code{foobar} at the end of the run.@refill + +The special patterns @code{BEGIN} and @code{END} cannot be used in ranges +or with boolean operators (indeed, they cannot be used with any operators). + +An @code{awk} program may have multiple @code{BEGIN} and/or @code{END} +rules. They are executed in the order they appear, all the @code{BEGIN} +rules at start-up and all the @code{END} rules at termination. + +Multiple @code{BEGIN} and @code{END} sections are useful for writing +library functions, since each library can have its own @code{BEGIN} or +@code{END} rule to do its own initialization and/or cleanup. Note that +the order in which library functions are named on the command line +controls the order in which their @code{BEGIN} and @code{END} rules are +executed. Therefore you have to be careful to write such rules in +library files so that the order in which they are executed doesn't matter. +@xref{Command Line, ,Invoking @code{awk}}, for more information on +using library functions. + +If an @code{awk} program only has a @code{BEGIN} rule, and no other +rules, then the program exits after the @code{BEGIN} rule has been run. +(Older versions of @code{awk} used to keep reading and ignoring input +until end of file was seen.) However, if an @code{END} rule exists as +well, then the input will be read, even if there are no other rules in +the program. This is necessary in case the @code{END} rule checks the +@code{NR} variable. + +@code{BEGIN} and @code{END} rules must have actions; there is no default +action for these rules since there is no current record when they run. + +@node Empty, , BEGIN/END, Patterns +@comment node-name, next, previous, up +@section The Empty Pattern + +@cindex empty pattern +@cindex pattern, empty +An empty pattern is considered to match @emph{every} input record. For +example, the program:@refill + +@example +awk '@{ print $1 @}' BBS-list +@end example + +@noindent +prints the first field of every record. + +@node Actions, Expressions, Patterns, Top +@chapter Overview of Actions +@cindex action, definition of +@cindex curly braces +@cindex action, curly braces +@cindex action, separating statements + +An @code{awk} program or script consists of a series of +rules and function definitions, interspersed. (Functions are +described later. @xref{User-defined, ,User-defined Functions}.) + +A rule contains a pattern and an action, either of which may be +omitted. The purpose of the @dfn{action} is to tell @code{awk} what to do +once a match for the pattern is found. Thus, the entire program +looks somewhat like this: + +@example +@r{[}@var{pattern}@r{]} @r{[}@{ @var{action} @}@r{]} +@r{[}@var{pattern}@r{]} @r{[}@{ @var{action} @}@r{]} +@dots{} +function @var{name} (@var{args}) @{ @dots{} @} +@dots{} +@end example + +An action consists of one or more @code{awk} @dfn{statements}, enclosed +in curly braces (@samp{@{} and @samp{@}}). Each statement specifies one +thing to be done. The statements are separated by newlines or +semicolons. + +The curly braces around an action must be used even if the action +contains only one statement, or even if it contains no statements at +all. However, if you omit the action entirely, omit the curly braces as +well. (An omitted action is equivalent to @samp{@{ print $0 @}}.) + +Here are the kinds of statements supported in @code{awk}: + +@itemize @bullet +@item +Expressions, which can call functions or assign values to variables +(@pxref{Expressions, ,Expressions as Action Statements}). Executing +this kind of statement simply computes the value of the expression and +then ignores it. This is useful when the expression has side effects +(@pxref{Assignment Ops, ,Assignment Expressions}).@refill + +@item +Control statements, which specify the control flow of @code{awk} +programs. The @code{awk} language gives you C-like constructs +(@code{if}, @code{for}, @code{while}, and so on) as well as a few +special ones (@pxref{Statements, ,Control Statements in Actions}).@refill + +@item +Compound statements, which consist of one or more statements enclosed in +curly braces. A compound statement is used in order to put several +statements together in the body of an @code{if}, @code{while}, @code{do} +or @code{for} statement. + +@item +Input control, using the @code{getline} command +(@pxref{Getline, ,Explicit Input with @code{getline}}), and the @code{next} +statement (@pxref{Next Statement, ,The @code{next} Statement}). + +@item +Output statements, @code{print} and @code{printf}. +@xref{Printing, ,Printing Output}.@refill + +@item +Deletion statements, for deleting array elements. +@xref{Delete, ,The @code{delete} Statement}.@refill +@end itemize + +@iftex +The next two chapters cover in detail expressions and control +statements, respectively. We go on to treat arrays and built-in +functions, both of which are used in expressions. Then we proceed +to discuss how to define your own functions. +@end iftex + +@node Expressions, Statements, Actions, Top +@chapter Expressions as Action Statements +@cindex expression + +Expressions are the basic building block of @code{awk} actions. An +expression evaluates to a value, which you can print, test, store in a +variable or pass to a function. But beyond that, an expression can assign a new value to a variable +or a field, with an assignment operator. + +An expression can serve as a statement on its own. Most other kinds of +statements contain one or more expressions which specify data to be +operated on. As in other languages, expressions in @code{awk} include +variables, array references, constants, and function calls, as well as +combinations of these with various operators. + +@menu +* Constants:: String, numeric, and regexp constants. +* Variables:: Variables give names to values for later use. +* Arithmetic Ops:: Arithmetic operations (@samp{+}, @samp{-}, etc.) +* Concatenation:: Concatenating strings. +* Comparison Ops:: Comparison of numbers and strings + with @samp{<}, etc. +* Boolean Ops:: Combining comparison expressions + using boolean operators + @samp{||} (``or''), @samp{&&} (``and'') and @samp{!} (``not''). + +* Assignment Ops:: Changing the value of a variable or a field. +* Increment Ops:: Incrementing the numeric value of a variable. + +* Conversion:: The conversion of strings to numbers + and vice versa. +* Values:: The whole truth about numbers and strings. +* Conditional Exp:: Conditional expressions select + between two subexpressions under control + of a third subexpression. +* Function Calls:: A function call is an expression. +* Precedence:: How various operators nest. +@end menu + +@node Constants, Variables, Expressions, Expressions +@section Constant Expressions +@cindex constants, types of +@cindex string constants + +The simplest type of expression is the @dfn{constant}, which always has +the same value. There are three types of constants: numeric constants, +string constants, and regular expression constants. + +@cindex numeric constant +@cindex numeric value +A @dfn{numeric constant} stands for a number. This number can be an +integer, a decimal fraction, or a number in scientific (exponential) +notation. Note that all numeric values are represented within +@code{awk} in double-precision floating point. Here are some examples +of numeric constants, which all have the same value: + +@example +105 +1.05e+2 +1050e-1 +@end example + +A string constant consists of a sequence of characters enclosed in +double-quote marks. For example: + +@example +"parrot" +@end example + +@noindent +@iftex +@cindex differences between @code{gawk} and @code{awk} +@end iftex +represents the string whose contents are @samp{parrot}. Strings in +@code{gawk} can be of any length and they can contain all the possible +8-bit ASCII characters including ASCII NUL. Other @code{awk} +implementations may have difficulty with some character codes.@refill + +@cindex escape sequence notation +Some characters cannot be included literally in a string constant. You +represent them instead with @dfn{escape sequences}, which are character +sequences beginning with a backslash (@samp{\}). + +One use of an escape sequence is to include a double-quote character in +a string constant. Since a plain double-quote would end the string, you +must use @samp{\"} to represent a single double-quote character as a +part of the string. +The +backslash character itself is another character that cannot be +included normally; you write @samp{\\} to put one backslash in the +string. Thus, the string whose contents are the two characters +@samp{"\} must be written @code{"\"\\"}. + +Another use of backslash is to represent unprintable characters +such as newline. While there is nothing to stop you from writing most +of these characters directly in a string constant, they may look ugly. + +Here is a table of all the escape sequences used in @code{awk}: + +@table @code +@item \\ +Represents a literal backslash, @samp{\}. + +@item \a +Represents the ``alert'' character, control-g, ASCII code 7. + +@item \b +Represents a backspace, control-h, ASCII code 8. + +@item \f +Represents a formfeed, control-l, ASCII code 12. + +@item \n +Represents a newline, control-j, ASCII code 10. + +@item \r +Represents a carriage return, control-m, ASCII code 13. + +@item \t +Represents a horizontal tab, control-i, ASCII code 9. + +@item \v +Represents a vertical tab, control-k, ASCII code 11. + +@item \@var{nnn} +Represents the octal value @var{nnn}, where @var{nnn} are one to three +digits between 0 and 7. For example, the code for the ASCII ESC +(escape) character is @samp{\033}.@refill + +@item \x@var{hh}@dots{} +Represents the hexadecimal value @var{hh}, where @var{hh} are hexadecimal +digits (@samp{0} through @samp{9} and either @samp{A} through @samp{F} or +@samp{a} through @samp{f}). Like the same construct in @sc{ansi} C, the escape +sequence continues until the first non-hexadecimal digit is seen. However, +using more than two hexadecimal digits produces undefined results. (The +@samp{\x} escape sequence is not allowed in @sc{posix} @code{awk}.)@refill +@end table + +A @dfn{constant regexp} is a regular expression description enclosed in +slashes, such as @code{/^beginning and end$/}. Most regexps used in +@code{awk} programs are constant, but the @samp{~} and @samp{!~} +operators can also match computed or ``dynamic'' regexps +(@pxref{Regexp Usage, ,How to Use Regular Expressions}).@refill + +Constant regexps may be used like simple expressions. When a +constant regexp is not on the right hand side of the @samp{~} or +@samp{!~} operators, it has the same meaning as if it appeared +in a pattern, i.e. @samp{($0 ~ /foo/)} +(@pxref{Expression Patterns, ,Expressions as Patterns}). +This means that the two code segments,@refill + +@example +if ($0 ~ /barfly/ || $0 ~ /camelot/) + print "found" +@end example + +@noindent +and + +@example +if (/barfly/ || /camelot/) + print "found" +@end example + +@noindent +are exactly equivalent. One rather bizarre consequence of this rule is +that the following boolean expression is legal, but does not do what the user +intended:@refill + +@example +if (/foo/ ~ $1) print "found foo" +@end example + +This code is ``obviously'' testing @code{$1} for a match against the regexp +@code{/foo/}. But in fact, the expression @code{(/foo/ ~ $1)} actually means +@code{(($0 ~ /foo/) ~ $1)}. In other words, first match the input record +against the regexp @code{/foo/}. The result will be either a 0 or a 1, +depending upon the success or failure of the match. Then match that result +against the first field in the record.@refill + +Since it is unlikely that you would ever really wish to make this kind of +test, @code{gawk} will issue a warning when it sees this construct in +a program.@refill + +Another consequence of this rule is that the assignment statement + +@example +matches = /foo/ +@end example + +@noindent +will assign either 0 or 1 to the variable @code{matches}, depending +upon the contents of the current input record. + +Constant regular expressions are also used as the first argument for +the @code{sub} and @code{gsub} functions +(@pxref{String Functions, ,Built-in Functions for String Manipulation}).@refill + +This feature of the language was never well documented until the +@sc{posix} specification. + +You may be wondering, when is + +@example +$1 ~ /foo/ @{ @dots{} @} +@end example + +@noindent +preferable to + +@example +$1 ~ "foo" @{ @dots{} @} +@end example + +Since the right-hand sides of both @samp{~} operators are constants, +it is more efficient to use the @samp{/foo/} form: @code{awk} can note +that you have supplied a regexp and store it internally in a form that +makes pattern matching more efficient. In the second form, @code{awk} +must first convert the string into this internal form, and then perform +the pattern matching. The first form is also better style; it shows +clearly that you intend a regexp match. + +@node Variables, Arithmetic Ops, Constants, Expressions +@section Variables +@cindex variables, user-defined +@cindex user-defined variables +@c there should be more than one subsection, ideally. Not a big deal. +@c But usually there are supposed to be at least two. One way to get +@c around this is to write the info in the subsection as the info in the +@c section itself and not have any subsections.. --mew + +Variables let you give names to values and refer to them later. You have +already seen variables in many of the examples. The name of a variable +must be a sequence of letters, digits and underscores, but it may not begin +with a digit. Case is significant in variable names; @code{a} and @code{A} +are distinct variables. + +A variable name is a valid expression by itself; it represents the +variable's current value. Variables are given new values with +@dfn{assignment operators} and @dfn{increment operators}. +@xref{Assignment Ops, ,Assignment Expressions}. + +A few variables have special built-in meanings, such as @code{FS}, the +field separator, and @code{NF}, the number of fields in the current +input record. @xref{Built-in Variables}, for a list of them. These +built-in variables can be used and assigned just like all other +variables, but their values are also used or changed automatically by +@code{awk}. Each built-in variable's name is made entirely of upper case +letters. + +Variables in @code{awk} can be assigned either numeric or string +values. By default, variables are initialized to the null string, which +is effectively zero if converted to a number. There is no need to +``initialize'' each variable explicitly in @code{awk}, the way you would in C or most other traditional languages. + +@menu +* Assignment Options:: Setting variables on the command line + and a summary of command line syntax. + This is an advanced method of input. +@end menu + +@node Assignment Options, , Variables, Variables +@subsection Assigning Variables on the Command Line + +You can set any @code{awk} variable by including a @dfn{variable assignment} +among the arguments on the command line when you invoke @code{awk} +(@pxref{Command Line, ,Invoking @code{awk}}). Such an assignment has +this form:@refill + +@example +@var{variable}=@var{text} +@end example + +@noindent +With it, you can set a variable either at the beginning of the +@code{awk} run or in between input files. + +If you precede the assignment with the @samp{-v} option, like this: + +@example +-v @var{variable}=@var{text} +@end example + +@noindent +then the variable is set at the very beginning, before even the +@code{BEGIN} rules are run. The @samp{-v} option and its assignment +must precede all the file name arguments, as well as the program text. + +Otherwise, the variable assignment is performed at a time determined by +its position among the input file arguments: after the processing of the +preceding input file argument. For example: + +@example +awk '@{ print $n @}' n=4 inventory-shipped n=2 BBS-list +@end example + +@noindent +prints the value of field number @code{n} for all input records. Before +the first file is read, the command line sets the variable @code{n} +equal to 4. This causes the fourth field to be printed in lines from +the file @file{inventory-shipped}. After the first file has finished, +but before the second file is started, @code{n} is set to 2, so that the +second field is printed in lines from @file{BBS-list}. + +Command line arguments are made available for explicit examination by +the @code{awk} program in an array named @code{ARGV} +(@pxref{Built-in Variables}).@refill + +@code{awk} processes the values of command line assignments for escape +sequences (@pxref{Constants, ,Constant Expressions}). + +@node Arithmetic Ops, Concatenation, Variables, Expressions +@section Arithmetic Operators +@cindex arithmetic operators +@cindex operators, arithmetic +@cindex addition +@cindex subtraction +@cindex multiplication +@cindex division +@cindex remainder +@cindex quotient +@cindex exponentiation + +The @code{awk} language uses the common arithmetic operators when +evaluating expressions. All of these arithmetic operators follow normal +precedence rules, and work as you would expect them to. This example +divides field three by field four, adds field two, stores the result +into field one, and prints the resulting altered input record: + +@example +awk '@{ $1 = $2 + $3 / $4; print @}' inventory-shipped +@end example + +The arithmetic operators in @code{awk} are: + +@table @code +@item @var{x} + @var{y} +Addition. + +@item @var{x} - @var{y} +Subtraction. + +@item - @var{x} +Negation. + +@item + @var{x} +Unary plus. No real effect on the expression. + +@item @var{x} * @var{y} +Multiplication. + +@item @var{x} / @var{y} +Division. Since all numbers in @code{awk} are double-precision +floating point, the result is not rounded to an integer: @code{3 / 4} +has the value 0.75. + +@item @var{x} % @var{y} +@iftex +@cindex differences between @code{gawk} and @code{awk} +@end iftex +Remainder. The quotient is rounded toward zero to an integer, +multiplied by @var{y} and this result is subtracted from @var{x}. +This operation is sometimes known as ``trunc-mod.'' The following +relation always holds: + +@example +b * int(a / b) + (a % b) == a +@end example + +One possibly undesirable effect of this definition of remainder is that +@code{@var{x} % @var{y}} is negative if @var{x} is negative. Thus, + +@example +-17 % 8 = -1 +@end example + +In other @code{awk} implementations, the signedness of the remainder +may be machine dependent. + +@item @var{x} ^ @var{y} +@itemx @var{x} ** @var{y} +Exponentiation: @var{x} raised to the @var{y} power. @code{2 ^ 3} has +the value 8. The character sequence @samp{**} is equivalent to +@samp{^}. (The @sc{posix} standard only specifies the use of @samp{^} +for exponentiation.) +@end table + +@node Concatenation, Comparison Ops, Arithmetic Ops, Expressions +@section String Concatenation + +@cindex string operators +@cindex operators, string +@cindex concatenation +There is only one string operation: concatenation. It does not have a +specific operator to represent it. Instead, concatenation is performed by +writing expressions next to one another, with no operator. For example: + +@example +awk '@{ print "Field number one: " $1 @}' BBS-list +@end example + +@noindent +produces, for the first record in @file{BBS-list}: + +@example +Field number one: aardvark +@end example + +Without the space in the string constant after the @samp{:}, the line +would run together. For example: + +@example +awk '@{ print "Field number one:" $1 @}' BBS-list +@end example + +@noindent +produces, for the first record in @file{BBS-list}: + +@example +Field number one:aardvark +@end example + +Since string concatenation does not have an explicit operator, it is +often necessary to insure that it happens where you want it to by +enclosing the items to be concatenated in parentheses. For example, the +following code fragment does not concatenate @code{file} and @code{name} +as you might expect: + +@example +file = "file" +name = "name" +print "something meaningful" > file name +@end example + +@noindent +It is necessary to use the following: + +@example +print "something meaningful" > (file name) +@end example + +We recommend you use parentheses around concatenation in all but the +most common contexts (such as in the right-hand operand of @samp{=}). + +@ignore +@code{gawk} actually now allows a concatenation on the right hand +side of a @code{>} redirection, but other @code{awk}s don't. So for +now we won't mention that fact. +@end ignore + +@node Comparison Ops, Boolean Ops, Concatenation, Expressions +@section Comparison Expressions +@cindex comparison expressions +@cindex expressions, comparison +@cindex relational operators +@cindex operators, relational +@cindex regexp operators + +@dfn{Comparison expressions} compare strings or numbers for +relationships such as equality. They are written using @dfn{relational +operators}, which are a superset of those in C. Here is a table of +them: + +@table @code +@item @var{x} < @var{y} +True if @var{x} is less than @var{y}. + +@item @var{x} <= @var{y} +True if @var{x} is less than or equal to @var{y}. + +@item @var{x} > @var{y} +True if @var{x} is greater than @var{y}. + +@item @var{x} >= @var{y} +True if @var{x} is greater than or equal to @var{y}. + +@item @var{x} == @var{y} +True if @var{x} is equal to @var{y}. + +@item @var{x} != @var{y} +True if @var{x} is not equal to @var{y}. + +@item @var{x} ~ @var{y} +True if the string @var{x} matches the regexp denoted by @var{y}. + +@item @var{x} !~ @var{y} +True if the string @var{x} does not match the regexp denoted by @var{y}. + +@item @var{subscript} in @var{array} +True if array @var{array} has an element with the subscript @var{subscript}. +@end table + +Comparison expressions have the value 1 if true and 0 if false. + +The rules @code{gawk} uses for performing comparisons are based on those +in draft 11.2 of the @sc{posix} standard. The @sc{posix} standard introduced +the concept of a @dfn{numeric string}, which is simply a string that looks +like a number, for example, @code{@w{" +2"}}. + +@vindex CONVFMT +When performing a relational operation, @code{gawk} considers the type of an +operand to be the type it received on its last @emph{assignment}, rather +than the type of its last @emph{use} +(@pxref{Values, ,Numeric and String Values}). +This type is @emph{unknown} when the operand is from an ``external'' source: +field variables, command line arguments, array elements resulting from a +@code{split} operation, and the value of an @code{ENVIRON} element. +In this case only, if the operand is a numeric string, then it is +considered to be of both string type and numeric type. If at least one +operand of a comparison is of string type only, then a string +comparison is performed. Any numeric operand will be converted to a +string using the value of @code{CONVFMT} +(@pxref{Conversion, ,Conversion of Strings and Numbers}). +If one operand of a comparison is numeric, and the other operand is +either numeric or both numeric and string, then @code{gawk} does a +numeric comparison. If both operands have both types, then the +comparison is numeric. Strings are compared +by comparing the first character of each, then the second character of each, +and so on. Thus @code{"10"} is less than @code{"9"}. If there are two +strings where one is a prefix of the other, the shorter string is less than +the longer one. Thus @code{"abc"} is less than @code{"abcd"}.@refill + +Here are some sample expressions, how @code{gawk} compares them, and what +the result of the comparison is. + +@table @code +@item 1.5 <= 2.0 +numeric comparison (true) + +@item "abc" >= "xyz" +string comparison (false) + +@item 1.5 != " +2" +string comparison (true) + +@item "1e2" < "3" +string comparison (true) + +@item a = 2; b = "2" +@itemx a == b +string comparison (true) +@end table + +@example +echo 1e2 3 | awk '@{ print ($1 < $2) ? "true" : "false" @}' +@end example + +@noindent +prints @samp{false} since both @code{$1} and @code{$2} are numeric +strings and thus have both string and numeric types, thus dictating +a numeric comparison. + +The purpose of the comparison rules and the use of numeric strings is +to attempt to produce the behavior that is ``least surprising,'' while +still ``doing the right thing.'' + +String comparisons and regular expression comparisons are very different. +For example, + +@example +$1 == "foo" +@end example + +@noindent +has the value of 1, or is true, if the first field of the current input +record is precisely @samp{foo}. By contrast, + +@example +$1 ~ /foo/ +@end example + +@noindent +has the value 1 if the first field contains @samp{foo}, such as @samp{foobar}. + +The right hand operand of the @samp{~} and @samp{!~} operators may be +either a constant regexp (@code{/@dots{}/}), or it may be an ordinary +expression, in which case the value of the expression as a string is a +dynamic regexp (@pxref{Regexp Usage, ,How to Use Regular Expressions}). + +@cindex regexp as expression +In very recent implementations of @code{awk}, a constant regular +expression in slashes by itself is also an expression. The regexp +@code{/@var{regexp}/} is an abbreviation for this comparison expression: + +@example +$0 ~ /@var{regexp}/ +@end example + +In some contexts it may be necessary to write parentheses around the +regexp to avoid confusing the @code{gawk} parser. For example, +@code{(/x/ - /y/) > threshold} is not allowed, but @code{((/x/) - (/y/)) +> threshold} parses properly. + +One special place where @code{/foo/} is @emph{not} an abbreviation for +@code{$0 ~ /foo/} is when it is the right-hand operand of @samp{~} or +@samp{!~}! @xref{Constants, ,Constant Expressions}, where this is +discussed in more detail. + +@node Boolean Ops, Assignment Ops, Comparison Ops, Expressions +@section Boolean Expressions +@cindex expressions, boolean +@cindex boolean expressions +@cindex operators, boolean +@cindex boolean operators +@cindex logical operations +@cindex and operator +@cindex or operator +@cindex not operator + +A @dfn{boolean expression} is a combination of comparison expressions or +matching expressions, using the boolean operators ``or'' +(@samp{||}), ``and'' (@samp{&&}), and ``not'' (@samp{!}), along with +parentheses to control nesting. The truth of the boolean expression is +computed by combining the truth values of the component expressions. + +Boolean expressions can be used wherever comparison and matching +expressions can be used. They can be used in @code{if}, @code{while} +@code{do} and @code{for} statements. They have numeric values (1 if true, +0 if false), which come into play if the result of the boolean expression +is stored in a variable, or used in arithmetic.@refill + +In addition, every boolean expression is also a valid boolean pattern, so +you can use it as a pattern to control the execution of rules. + +Here are descriptions of the three boolean operators, with an example of +each. It may be instructive to compare these examples with the +analogous examples of boolean patterns +(@pxref{Boolean Patterns, ,Boolean Operators and Patterns}), which +use the same boolean operators in patterns instead of expressions.@refill + +@table @code +@item @var{boolean1} && @var{boolean2} +True if both @var{boolean1} and @var{boolean2} are true. For example, +the following statement prints the current input record if it contains +both @samp{2400} and @samp{foo}.@refill + +@smallexample +if ($0 ~ /2400/ && $0 ~ /foo/) print +@end smallexample + +The subexpression @var{boolean2} is evaluated only if @var{boolean1} +is true. This can make a difference when @var{boolean2} contains +expressions that have side effects: in the case of @code{$0 ~ /foo/ && +($2 == bar++)}, the variable @code{bar} is not incremented if there is +no @samp{foo} in the record. + +@item @var{boolean1} || @var{boolean2} +True if at least one of @var{boolean1} or @var{boolean2} is true. +For example, the following command prints all records in the input +file @file{BBS-list} that contain @emph{either} @samp{2400} or +@samp{foo}, or both.@refill + +@smallexample +awk '@{ if ($0 ~ /2400/ || $0 ~ /foo/) print @}' BBS-list +@end smallexample + +The subexpression @var{boolean2} is evaluated only if @var{boolean1} +is false. This can make a difference when @var{boolean2} contains +expressions that have side effects. + +@item !@var{boolean} +True if @var{boolean} is false. For example, the following program prints +all records in the input file @file{BBS-list} that do @emph{not} contain the +string @samp{foo}. + +@smallexample +awk '@{ if (! ($0 ~ /foo/)) print @}' BBS-list +@end smallexample +@end table + +@node Assignment Ops, Increment Ops, Boolean Ops, Expressions +@section Assignment Expressions +@cindex assignment operators +@cindex operators, assignment +@cindex expressions, assignment + +An @dfn{assignment} is an expression that stores a new value into a +variable. For example, let's assign the value 1 to the variable +@code{z}:@refill + +@example +z = 1 +@end example + +After this expression is executed, the variable @code{z} has the value 1. +Whatever old value @code{z} had before the assignment is forgotten. + +Assignments can store string values also. For example, this would store +the value @code{"this food is good"} in the variable @code{message}: + +@example +thing = "food" +predicate = "good" +message = "this " thing " is " predicate +@end example + +@noindent +(This also illustrates concatenation of strings.) + +The @samp{=} sign is called an @dfn{assignment operator}. It is the +simplest assignment operator because the value of the right-hand +operand is stored unchanged. + +@cindex side effect +Most operators (addition, concatenation, and so on) have no effect +except to compute a value. If you ignore the value, you might as well +not use the operator. An assignment operator is different; it does +produce a value, but even if you ignore the value, the assignment still +makes itself felt through the alteration of the variable. We call this +a @dfn{side effect}. + +@cindex lvalue +The left-hand operand of an assignment need not be a variable +(@pxref{Variables}); it can also be a field +(@pxref{Changing Fields, ,Changing the Contents of a Field}) or +an array element (@pxref{Arrays, ,Arrays in @code{awk}}). +These are all called @dfn{lvalues}, +which means they can appear on the left-hand side of an assignment operator. +The right-hand operand may be any expression; it produces the new value +which the assignment stores in the specified variable, field or array +element.@refill + +It is important to note that variables do @emph{not} have permanent types. +The type of a variable is simply the type of whatever value it happens +to hold at the moment. In the following program fragment, the variable +@code{foo} has a numeric value at first, and a string value later on: + +@example +foo = 1 +print foo +foo = "bar" +print foo +@end example + +@noindent +When the second assignment gives @code{foo} a string value, the fact that +it previously had a numeric value is forgotten. + +An assignment is an expression, so it has a value: the same value that +is assigned. Thus, @code{z = 1} as an expression has the value 1. +One consequence of this is that you can write multiple assignments together: + +@example +x = y = z = 0 +@end example + +@noindent +stores the value 0 in all three variables. It does this because the +value of @code{z = 0}, which is 0, is stored into @code{y}, and then +the value of @code{y = z = 0}, which is 0, is stored into @code{x}. + +You can use an assignment anywhere an expression is called for. For +example, it is valid to write @code{x != (y = 1)} to set @code{y} to 1 +and then test whether @code{x} equals 1. But this style tends to make +programs hard to read; except in a one-shot program, you should +rewrite it to get rid of such nesting of assignments. This is never very +hard. + +Aside from @samp{=}, there are several other assignment operators that +do arithmetic with the old value of the variable. For example, the +operator @samp{+=} computes a new value by adding the right-hand value +to the old value of the variable. Thus, the following assignment adds +5 to the value of @code{foo}: + +@example +foo += 5 +@end example + +@noindent +This is precisely equivalent to the following: + +@example +foo = foo + 5 +@end example + +@noindent +Use whichever one makes the meaning of your program clearer. + +Here is a table of the arithmetic assignment operators. In each +case, the right-hand operand is an expression whose value is converted +to a number. + +@table @code +@item @var{lvalue} += @var{increment} +Adds @var{increment} to the value of @var{lvalue} to make the new value +of @var{lvalue}. + +@item @var{lvalue} -= @var{decrement} +Subtracts @var{decrement} from the value of @var{lvalue}. + +@item @var{lvalue} *= @var{coefficient} +Multiplies the value of @var{lvalue} by @var{coefficient}. + +@item @var{lvalue} /= @var{quotient} +Divides the value of @var{lvalue} by @var{quotient}. + +@item @var{lvalue} %= @var{modulus} +Sets @var{lvalue} to its remainder by @var{modulus}. + +@item @var{lvalue} ^= @var{power} +@itemx @var{lvalue} **= @var{power} +Raises @var{lvalue} to the power @var{power}. +(Only the @code{^=} operator is specified by @sc{posix}.) +@end table + +@ignore +From: gatech!ames!elroy!cit-vax!EQL.Caltech.Edu!rankin (Pat Rankin) + In the discussion of assignment operators, it states that +``foo += 5'' "is precisely equivalent to" ``foo = foo + 5'' (p.77). That +may be true for simple variables, but it's not true for expressions with +side effects, like array references. For proof, try + BEGIN { + foo[rand()] += 5; for (x in foo) print x, foo[x] + bar[rand()] = bar[rand()] + 5; for (x in bar) print x, bar[x] + } +I suspect that the original statement is simply untrue--that '+=' is more +efficient in all cases. + +ADR --- Try to add something about this here for the next go 'round. +@end ignore + +@node Increment Ops, Conversion, Assignment Ops, Expressions +@section Increment Operators + +@cindex increment operators +@cindex operators, increment +@dfn{Increment operators} increase or decrease the value of a variable +by 1. You could do the same thing with an assignment operator, so +the increment operators add no power to the @code{awk} language; but they +are convenient abbreviations for something very common. + +The operator to add 1 is written @samp{++}. It can be used to increment +a variable either before or after taking its value. + +To pre-increment a variable @var{v}, write @code{++@var{v}}. This adds +1 to the value of @var{v} and that new value is also the value of this +expression. The assignment expression @code{@var{v} += 1} is completely +equivalent. + +Writing the @samp{++} after the variable specifies post-increment. This +increments the variable value just the same; the difference is that the +value of the increment expression itself is the variable's @emph{old} +value. Thus, if @code{foo} has the value 4, then the expression @code{foo++} +has the value 4, but it changes the value of @code{foo} to 5. + +The post-increment @code{foo++} is nearly equivalent to writing @code{(foo ++= 1) - 1}. It is not perfectly equivalent because all numbers in +@code{awk} are floating point: in floating point, @code{foo + 1 - 1} does +not necessarily equal @code{foo}. But the difference is minute as +long as you stick to numbers that are fairly small (less than a trillion). + +Any lvalue can be incremented. Fields and array elements are incremented +just like variables. (Use @samp{$(i++)} when you wish to do a field reference +and a variable increment at the same time. The parentheses are necessary +because of the precedence of the field reference operator, @samp{$}.) +@c expert information in the last parenthetical remark + +The decrement operator @samp{--} works just like @samp{++} except that +it subtracts 1 instead of adding. Like @samp{++}, it can be used before +the lvalue to pre-decrement or after it to post-decrement. + +Here is a summary of increment and decrement expressions. + +@table @code +@item ++@var{lvalue} +This expression increments @var{lvalue} and the new value becomes the +value of this expression. + +@item @var{lvalue}++ +This expression causes the contents of @var{lvalue} to be incremented. +The value of the expression is the @emph{old} value of @var{lvalue}. + +@item --@var{lvalue} +Like @code{++@var{lvalue}}, but instead of adding, it subtracts. It +decrements @var{lvalue} and delivers the value that results. + +@item @var{lvalue}-- +Like @code{@var{lvalue}++}, but instead of adding, it subtracts. It +decrements @var{lvalue}. The value of the expression is the @emph{old} +value of @var{lvalue}. +@end table + +@node Conversion, Values, Increment Ops, Expressions +@section Conversion of Strings and Numbers + +@cindex conversion of strings and numbers +Strings are converted to numbers, and numbers to strings, if the context +of the @code{awk} program demands it. For example, if the value of +either @code{foo} or @code{bar} in the expression @code{foo + bar} +happens to be a string, it is converted to a number before the addition +is performed. If numeric values appear in string concatenation, they +are converted to strings. Consider this:@refill + +@example +two = 2; three = 3 +print (two three) + 4 +@end example + +@noindent +This eventually prints the (numeric) value 27. The numeric values of +the variables @code{two} and @code{three} are converted to strings and +concatenated together, and the resulting string is converted back to the +number 23, to which 4 is then added. + +If, for some reason, you need to force a number to be converted to a +string, concatenate the null string with that number. To force a string +to be converted to a number, add zero to that string. + +A string is converted to a number by interpreting a numeric prefix +of the string as numerals: +@code{"2.5"} converts to 2.5, @code{"1e3"} converts to 1000, and @code{"25fix"} +has a numeric value of 25. +Strings that can't be interpreted as valid numbers are converted to +zero. + +@vindex CONVFMT +The exact manner in which numbers are converted into strings is controlled +by the @code{awk} built-in variable @code{CONVFMT} (@pxref{Built-in Variables}). +Numbers are converted using a special version of the @code{sprintf} function +(@pxref{Built-in, ,Built-in Functions}) with @code{CONVFMT} as the format +specifier.@refill + +@code{CONVFMT}'s default value is @code{"%.6g"}, which prints a value with +at least six significant digits. For some applications you will want to +change it to specify more precision. Double precision on most modern +machines gives you 16 or 17 decimal digits of precision. + +Strange results can happen if you set @code{CONVFMT} to a string that doesn't +tell @code{sprintf} how to format floating point numbers in a useful way. +For example, if you forget the @samp{%} in the format, all numbers will be +converted to the same constant string.@refill + +As a special case, if a number is an integer, then the result of converting +it to a string is @emph{always} an integer, no matter what the value of +@code{CONVFMT} may be. Given the following code fragment: + +@example +CONVFMT = "%2.2f" +a = 12 +b = a "" +@end example + +@noindent +@code{b} has the value @code{"12"}, not @code{"12.00"}. + +@ignore +For the 2.14 version, describe the ``stickyness'' of conversions. Right now +the manual assumes everywhere that variables are either numbers or strings; +in fact both kinds of values may be valid. If both happen to be valid, a +conversion isn't necessary and isn't done. Revising the manual to be +consistent with this, though, is too big a job to tackle at the moment. + +7/92: This has sort of been done, only the section isn't completely right! + What to do? +7/92: Pretty much fixed, at least for the short term, thanks to text + from David. +@end ignore + +@vindex OFMT +Prior to the @sc{posix} standard, @code{awk} specified that the value +of @code{OFMT} was used for converting numbers to strings. @code{OFMT} +specifies the output format to use when printing numbers with @code{print}. +@code{CONVFMT} was introduced in order to separate the semantics of +conversions from the semantics of printing. Both @code{CONVFMT} and +@code{OFMT} have the same default value: @code{"%.6g"}. In the vast majority +of cases, old @code{awk} programs will not change their behavior. +However, this use of @code{OFMT} is something to keep in mind if you must +port your program to other implementations of @code{awk}; we recommend +that instead of changing your programs, you just port @code{gawk} itself!@refill + +@node Values, Conditional Exp, Conversion, Expressions +@section Numeric and String Values +@cindex conversion of strings and numbers + +Through most of this manual, we present @code{awk} values (such as constants, +fields, or variables) as @emph{either} numbers @emph{or} strings. This is +a convenient way to think about them, since typically they are used in only +one way, or the other. + +In truth though, @code{awk} values can be @emph{both} string and +numeric, at the same time. Internally, @code{awk} represents values +with a string, a (floating point) number, and an indication that one, +the other, or both representations of the value are valid. + +Keeping track of both kinds of values is important for execution +efficiency: a variable can acquire a string value the first time it +is used as a string, and then that string value can be used until the +variable is assigned a new value. Thus, if a variable with only a numeric +value is used in several concatenations in a row, it only has to be given +a string representation once. The numeric value remains valid, so that +no conversion back to a number is necessary if the variable is later used +in an arithmetic expression. + +Tracking both kinds of values is also important for precise numerical +calculations. Consider the following: + +@smallexample +a = 123.321 +CONVFMT = "%3.1f" +b = a " is a number" +c = a + 1.654 +@end smallexample + +@noindent +The variable @code{a} receives a string value in the concatenation and +assignment to @code{b}. The string value of @code{a} is @code{"123.3"}. +If the numeric value was lost when it was converted to a string, then the +numeric use of @code{a} in the last statement would lose information. +@code{c} would be assigned the value 124.954 instead of 124.975. +Such errors accumulate rapidly, and very adversely affect numeric +computations.@refill + +Once a numeric value acquires a corresponding string value, it stays valid +until a new assignment is made. If @code{CONVFMT} +(@pxref{Conversion, ,Conversion of Strings and Numbers}) changes in the +meantime, the old string value will still be used. For example:@refill + +@smallexample +BEGIN @{ + CONVFMT = "%2.2f" + a = 123.456 + b = a "" # force `a' to have string value too + printf "a = %s\n", a + CONVFMT = "%.6g" + printf "a = %s\n", a + a += 0 # make `a' numeric only again + printf "a = %s\n", a # use `a' as string +@} +@end smallexample + +@noindent +This program prints @samp{a = 123.46} twice, and then prints +@samp{a = 123.456}. + +@xref{Conversion, ,Conversion of Strings and Numbers}, for the rules that +specify how string values are made from numeric values. + +@node Conditional Exp, Function Calls, Values, Expressions +@section Conditional Expressions +@cindex conditional expression +@cindex expression, conditional + +A @dfn{conditional expression} is a special kind of expression with +three operands. It allows you to use one expression's value to select +one of two other expressions. + +The conditional expression looks the same as in the C language: + +@example +@var{selector} ? @var{if-true-exp} : @var{if-false-exp} +@end example + +@noindent +There are three subexpressions. The first, @var{selector}, is always +computed first. If it is ``true'' (not zero and not null) then +@var{if-true-exp} is computed next and its value becomes the value of +the whole expression. Otherwise, @var{if-false-exp} is computed next +and its value becomes the value of the whole expression.@refill + +For example, this expression produces the absolute value of @code{x}: + +@example +x > 0 ? x : -x +@end example + +Each time the conditional expression is computed, exactly one of +@var{if-true-exp} and @var{if-false-exp} is computed; the other is ignored. +This is important when the expressions contain side effects. For example, +this conditional expression examines element @code{i} of either array +@code{a} or array @code{b}, and increments @code{i}. + +@example +x == y ? a[i++] : b[i++] +@end example + +@noindent +This is guaranteed to increment @code{i} exactly once, because each time +one or the other of the two increment expressions is executed, +and the other is not. + +@node Function Calls, Precedence, Conditional Exp, Expressions +@section Function Calls +@cindex function call +@cindex calling a function + +A @dfn{function} is a name for a particular calculation. Because it has +a name, you can ask for it by name at any point in the program. For +example, the function @code{sqrt} computes the square root of a number. + +A fixed set of functions are @dfn{built-in}, which means they are +available in every @code{awk} program. The @code{sqrt} function is one +of these. @xref{Built-in, ,Built-in Functions}, for a list of built-in +functions and their descriptions. In addition, you can define your own +functions in the program for use elsewhere in the same program. +@xref{User-defined, ,User-defined Functions}, for how to do this.@refill + +@cindex arguments in function call +The way to use a function is with a @dfn{function call} expression, +which consists of the function name followed by a list of +@dfn{arguments} in parentheses. The arguments are expressions which +give the raw materials for the calculation that the function will do. +When there is more than one argument, they are separated by commas. If +there are no arguments, write just @samp{()} after the function name. +Here are some examples: + +@example +sqrt(x^2 + y^2) # @r{One argument} +atan2(y, x) # @r{Two arguments} +rand() # @r{No arguments} +@end example + +@strong{Do not put any space between the function name and the +open-parenthesis!} A user-defined function name looks just like the name of +a variable, and space would make the expression look like concatenation +of a variable with an expression inside parentheses. Space before the +parenthesis is harmless with built-in functions, but it is best not to get +into the habit of using space to avoid mistakes with user-defined +functions. + +Each function expects a particular number of arguments. For example, the +@code{sqrt} function must be called with a single argument, the number +to take the square root of: + +@example +sqrt(@var{argument}) +@end example + +Some of the built-in functions allow you to omit the final argument. +If you do so, they use a reasonable default. +@xref{Built-in, ,Built-in Functions}, for full details. If arguments +are omitted in calls to user-defined functions, then those arguments are +treated as local variables, initialized to the null string +(@pxref{User-defined, ,User-defined Functions}).@refill + +Like every other expression, the function call has a value, which is +computed by the function based on the arguments you give it. In this +example, the value of @code{sqrt(@var{argument})} is the square root of the +argument. A function can also have side effects, such as assigning the +values of certain variables or doing I/O. + +Here is a command to read numbers, one number per line, and print the +square root of each one: + +@example +awk '@{ print "The square root of", $1, "is", sqrt($1) @}' +@end example + +@node Precedence, , Function Calls, Expressions +@section Operator Precedence (How Operators Nest) +@cindex precedence +@cindex operator precedence + +@dfn{Operator precedence} determines how operators are grouped, when +different operators appear close by in one expression. For example, +@samp{*} has higher precedence than @samp{+}; thus, @code{a + b * c} +means to multiply @code{b} and @code{c}, and then add @code{a} to the +product (i.e., @code{a + (b * c)}). + +You can overrule the precedence of the operators by using parentheses. +You can think of the precedence rules as saying where the +parentheses are assumed if you do not write parentheses yourself. In +fact, it is wise to always use parentheses whenever you have an unusual +combination of operators, because other people who read the program may +not remember what the precedence is in this case. You might forget, +too; then you could make a mistake. Explicit parentheses will help prevent +any such mistake. + +When operators of equal precedence are used together, the leftmost +operator groups first, except for the assignment, conditional and +exponentiation operators, which group in the opposite order. +Thus, @code{a - b + c} groups as @code{(a - b) + c}; +@code{a = b = c} groups as @code{a = (b = c)}.@refill + +The precedence of prefix unary operators does not matter as long as only +unary operators are involved, because there is only one way to parse +them---innermost first. Thus, @code{$++i} means @code{$(++i)} and +@code{++$x} means @code{++($x)}. However, when another operator follows +the operand, then the precedence of the unary operators can matter. +Thus, @code{$x^2} means @code{($x)^2}, but @code{-x^2} means +@code{-(x^2)}, because @samp{-} has lower precedence than @samp{^} +while @samp{$} has higher precedence. + +Here is a table of the operators of @code{awk}, in order of increasing +precedence: + +@table @asis +@item assignment +@samp{=}, @samp{+=}, @samp{-=}, @samp{*=}, @samp{/=}, @samp{%=}, +@samp{^=}, @samp{**=}. These operators group right-to-left. +(The @samp{**=} operator is not specified by @sc{posix}.) + +@item conditional +@samp{?:}. This operator groups right-to-left. + +@item logical ``or''. +@samp{||}. + +@item logical ``and''. +@samp{&&}. + +@item array membership +@samp{in}. + +@item matching +@samp{~}, @samp{!~}. + +@item relational, and redirection +The relational operators and the redirections have the same precedence +level. Characters such as @samp{>} serve both as relationals and as +redirections; the context distinguishes between the two meanings. + +The relational operators are @samp{<}, @samp{<=}, @samp{==}, @samp{!=}, +@samp{>=} and @samp{>}. + +The I/O redirection operators are @samp{<}, @samp{>}, @samp{>>} and +@samp{|}. + +Note that I/O redirection operators in @code{print} and @code{printf} +statements belong to the statement level, not to expressions. The +redirection does not produce an expression which could be the operand of +another operator. As a result, it does not make sense to use a +redirection operator near another operator of lower precedence, without +parentheses. Such combinations, for example @samp{print foo > a ? b : +c}, result in syntax errors. + +@item concatenation +No special token is used to indicate concatenation. +The operands are simply written side by side. + +@item add, subtract +@samp{+}, @samp{-}. + +@item multiply, divide, mod +@samp{*}, @samp{/}, @samp{%}. + +@item unary plus, minus, ``not'' +@samp{+}, @samp{-}, @samp{!}. + +@item exponentiation +@samp{^}, @samp{**}. These operators group right-to-left. +(The @samp{**} operator is not specified by @sc{posix}.) + +@item increment, decrement +@samp{++}, @samp{--}. + +@item field +@samp{$}. +@end table + +@node Statements, Arrays, Expressions, Top +@chapter Control Statements in Actions +@cindex control statement + +@dfn{Control statements} such as @code{if}, @code{while}, and so on +control the flow of execution in @code{awk} programs. Most of the +control statements in @code{awk} are patterned on similar statements in +C. + +All the control statements start with special keywords such as @code{if} +and @code{while}, to distinguish them from simple expressions. + +Many control statements contain other statements; for example, the +@code{if} statement contains another statement which may or may not be +executed. The contained statement is called the @dfn{body}. If you +want to include more than one statement in the body, group them into a +single compound statement with curly braces, separating them with +newlines or semicolons. + +@menu +* If Statement:: Conditionally execute + some @code{awk} statements. +* While Statement:: Loop until some condition is satisfied. +* Do Statement:: Do specified action while looping until some + condition is satisfied. +* For Statement:: Another looping statement, that provides + initialization and increment clauses. +* Break Statement:: Immediately exit the innermost enclosing loop. +* Continue Statement:: Skip to the end of the innermost + enclosing loop. +* Next Statement:: Stop processing the current input record. +* Next File Statement:: Stop processing the current file. +* Exit Statement:: Stop execution of @code{awk}. +@end menu + +@node If Statement, While Statement, Statements, Statements +@section The @code{if} Statement + +@cindex @code{if} statement +The @code{if}-@code{else} statement is @code{awk}'s decision-making +statement. It looks like this:@refill + +@example +if (@var{condition}) @var{then-body} @r{[}else @var{else-body}@r{]} +@end example + +@noindent +@var{condition} is an expression that controls what the rest of the +statement will do. If @var{condition} is true, @var{then-body} is +executed; otherwise, @var{else-body} is executed (assuming that the +@code{else} clause is present). The @code{else} part of the statement is +optional. The condition is considered false if its value is zero or +the null string, and true otherwise.@refill + +Here is an example: + +@example +if (x % 2 == 0) + print "x is even" +else + print "x is odd" +@end example + +In this example, if the expression @code{x % 2 == 0} is true (that is, +the value of @code{x} is divisible by 2), then the first @code{print} +statement is executed, otherwise the second @code{print} statement is +performed.@refill + +If the @code{else} appears on the same line as @var{then-body}, and +@var{then-body} is not a compound statement (i.e., not surrounded by +curly braces), then a semicolon must separate @var{then-body} from +@code{else}. To illustrate this, let's rewrite the previous example: + +@example +awk '@{ if (x % 2 == 0) print "x is even"; else + print "x is odd" @}' +@end example + +@noindent +If you forget the @samp{;}, @code{awk} won't be able to parse the +statement, and you will get a syntax error. + +We would not actually write this example this way, because a human +reader might fail to see the @code{else} if it were not the first thing +on its line. + +@node While Statement, Do Statement, If Statement, Statements +@section The @code{while} Statement +@cindex @code{while} statement +@cindex loop +@cindex body of a loop + +In programming, a @dfn{loop} means a part of a program that is (or at least can +be) executed two or more times in succession. + +The @code{while} statement is the simplest looping statement in +@code{awk}. It repeatedly executes a statement as long as a condition is +true. It looks like this: + +@example +while (@var{condition}) + @var{body} +@end example + +@noindent +Here @var{body} is a statement that we call the @dfn{body} of the loop, +and @var{condition} is an expression that controls how long the loop +keeps running. + +The first thing the @code{while} statement does is test @var{condition}. +If @var{condition} is true, it executes the statement @var{body}. +(@var{condition} is true when the value +is not zero and not a null string.) After @var{body} has been executed, +@var{condition} is tested again, and if it is still true, @var{body} is +executed again. This process repeats until @var{condition} is no longer +true. If @var{condition} is initially false, the body of the loop is +never executed.@refill + +This example prints the first three fields of each record, one per line. + +@example +awk '@{ i = 1 + while (i <= 3) @{ + print $i + i++ + @} +@}' +@end example + +@noindent +Here the body of the loop is a compound statement enclosed in braces, +containing two statements. + +The loop works like this: first, the value of @code{i} is set to 1. +Then, the @code{while} tests whether @code{i} is less than or equal to +three. This is the case when @code{i} equals one, so the @code{i}-th +field is printed. Then the @code{i++} increments the value of @code{i} +and the loop repeats. The loop terminates when @code{i} reaches 4. + +As you can see, a newline is not required between the condition and the +body; but using one makes the program clearer unless the body is a +compound statement or is very simple. The newline after the open-brace +that begins the compound statement is not required either, but the +program would be hard to read without it. + +@node Do Statement, For Statement, While Statement, Statements +@section The @code{do}-@code{while} Statement + +The @code{do} loop is a variation of the @code{while} looping statement. +The @code{do} loop executes the @var{body} once, then repeats @var{body} +as long as @var{condition} is true. It looks like this: + +@example +do + @var{body} +while (@var{condition}) +@end example + +Even if @var{condition} is false at the start, @var{body} is executed at +least once (and only once, unless executing @var{body} makes +@var{condition} true). Contrast this with the corresponding +@code{while} statement: + +@example +while (@var{condition}) + @var{body} +@end example + +@noindent +This statement does not execute @var{body} even once if @var{condition} +is false to begin with. + +Here is an example of a @code{do} statement: + +@example +awk '@{ i = 1 + do @{ + print $0 + i++ + @} while (i <= 10) +@}' +@end example + +@noindent +prints each input record ten times. It isn't a very realistic example, +since in this case an ordinary @code{while} would do just as well. But +this reflects actual experience; there is only occasionally a real use +for a @code{do} statement.@refill + +@node For Statement, Break Statement, Do Statement, Statements +@section The @code{for} Statement +@cindex @code{for} statement + +The @code{for} statement makes it more convenient to count iterations of a +loop. The general form of the @code{for} statement looks like this:@refill + +@example +for (@var{initialization}; @var{condition}; @var{increment}) + @var{body} +@end example + +@noindent +This statement starts by executing @var{initialization}. Then, as long +as @var{condition} is true, it repeatedly executes @var{body} and then +@var{increment}. Typically @var{initialization} sets a variable to +either zero or one, @var{increment} adds 1 to it, and @var{condition} +compares it against the desired number of iterations. + +Here is an example of a @code{for} statement: + +@example +@group +awk '@{ for (i = 1; i <= 3; i++) + print $i +@}' +@end group +@end example + +@noindent +This prints the first three fields of each input record, one field per +line. + +In the @code{for} statement, @var{body} stands for any statement, but +@var{initialization}, @var{condition} and @var{increment} are just +expressions. You cannot set more than one variable in the +@var{initialization} part unless you use a multiple assignment statement +such as @code{x = y = 0}, which is possible only if all the initial values +are equal. (But you can initialize additional variables by writing +their assignments as separate statements preceding the @code{for} loop.) + +The same is true of the @var{increment} part; to increment additional +variables, you must write separate statements at the end of the loop. +The C compound expression, using C's comma operator, would be useful in +this context, but it is not supported in @code{awk}. + +Most often, @var{increment} is an increment expression, as in the +example above. But this is not required; it can be any expression +whatever. For example, this statement prints all the powers of 2 +between 1 and 100: + +@example +for (i = 1; i <= 100; i *= 2) + print i +@end example + +Any of the three expressions in the parentheses following the @code{for} may +be omitted if there is nothing to be done there. Thus, @w{@samp{for (;x +> 0;)}} is equivalent to @w{@samp{while (x > 0)}}. If the +@var{condition} is omitted, it is treated as @var{true}, effectively +yielding an @dfn{infinite loop} (i.e., a loop that will never +terminate).@refill + +In most cases, a @code{for} loop is an abbreviation for a @code{while} +loop, as shown here: + +@example +@var{initialization} +while (@var{condition}) @{ + @var{body} + @var{increment} +@} +@end example + +@noindent +The only exception is when the @code{continue} statement +(@pxref{Continue Statement, ,The @code{continue} Statement}) is used +inside the loop; changing a @code{for} statement to a @code{while} +statement in this way can change the effect of the @code{continue} +statement inside the loop.@refill + +There is an alternate version of the @code{for} loop, for iterating over +all the indices of an array: + +@example +for (i in array) + @var{do something with} array[i] +@end example + +@noindent +@xref{Arrays, ,Arrays in @code{awk}}, for more information on this +version of the @code{for} loop. + +The @code{awk} language has a @code{for} statement in addition to a +@code{while} statement because often a @code{for} loop is both less work to +type and more natural to think of. Counting the number of iterations is +very common in loops. It can be easier to think of this counting as part +of looping rather than as something to do inside the loop. + +The next section has more complicated examples of @code{for} loops. + +@node Break Statement, Continue Statement, For Statement, Statements +@section The @code{break} Statement +@cindex @code{break} statement +@cindex loops, exiting + +The @code{break} statement jumps out of the innermost @code{for}, +@code{while}, or @code{do}-@code{while} loop that encloses it. The +following example finds the smallest divisor of any integer, and also +identifies prime numbers:@refill + +@smallexample +awk '# find smallest divisor of num + @{ num = $1 + for (div = 2; div*div <= num; div++) + if (num % div == 0) + break + if (num % div == 0) + printf "Smallest divisor of %d is %d\n", num, div + else + printf "%d is prime\n", num @}' +@end smallexample + +When the remainder is zero in the first @code{if} statement, @code{awk} +immediately @dfn{breaks out} of the containing @code{for} loop. This means +that @code{awk} proceeds immediately to the statement following the loop +and continues processing. (This is very different from the @code{exit} +statement which stops the entire @code{awk} program. +@xref{Exit Statement, ,The @code{exit} Statement}.)@refill + +Here is another program equivalent to the previous one. It illustrates how +the @var{condition} of a @code{for} or @code{while} could just as well be +replaced with a @code{break} inside an @code{if}: + +@smallexample +@group +awk '# find smallest divisor of num + @{ num = $1 + for (div = 2; ; div++) @{ + if (num % div == 0) @{ + printf "Smallest divisor of %d is %d\n", num, div + break + @} + if (div*div > num) @{ + printf "%d is prime\n", num + break + @} + @} +@}' +@end group +@end smallexample + +@node Continue Statement, Next Statement, Break Statement, Statements +@section The @code{continue} Statement + +@cindex @code{continue} statement +The @code{continue} statement, like @code{break}, is used only inside +@code{for}, @code{while}, and @code{do}-@code{while} loops. It skips +over the rest of the loop body, causing the next cycle around the loop +to begin immediately. Contrast this with @code{break}, which jumps out +of the loop altogether. Here is an example:@refill + +@example +# print names that don't contain the string "ignore" + +# first, save the text of each line +@{ names[NR] = $0 @} + +# print what we're interested in +END @{ + for (x in names) @{ + if (names[x] ~ /ignore/) + continue + print names[x] + @} +@} +@end example + +If one of the input records contains the string @samp{ignore}, this +example skips the print statement for that record, and continues back to +the first statement in the loop. + +This is not a practical example of @code{continue}, since it would be +just as easy to write the loop like this: + +@example +for (x in names) + if (names[x] !~ /ignore/) + print names[x] +@end example + +@ignore +from brennan@boeing.com: + +page 90, section 9.6. The example is too artificial as +the one line program + + !/ignore/ + +does the same thing. +@end ignore +@c ADR --- he's right, but don't worry about this for now + +The @code{continue} statement in a @code{for} loop directs @code{awk} to +skip the rest of the body of the loop, and resume execution with the +increment-expression of the @code{for} statement. The following program +illustrates this fact:@refill + +@example +awk 'BEGIN @{ + for (x = 0; x <= 20; x++) @{ + if (x == 5) + continue + printf ("%d ", x) + @} + print "" +@}' +@end example + +@noindent +This program prints all the numbers from 0 to 20, except for 5, for +which the @code{printf} is skipped. Since the increment @code{x++} +is not skipped, @code{x} does not remain stuck at 5. Contrast the +@code{for} loop above with the @code{while} loop: + +@example +awk 'BEGIN @{ + x = 0 + while (x <= 20) @{ + if (x == 5) + continue + printf ("%d ", x) + x++ + @} + print "" +@}' +@end example + +@noindent +This program loops forever once @code{x} gets to 5. + +As described above, the @code{continue} statement has no meaning when +used outside the body of a loop. However, although it was never documented, +historical implementations of @code{awk} have treated the @code{continue} +statement outside of a loop as if it were a @code{next} statement +(@pxref{Next Statement, ,The @code{next} Statement}). +By default, @code{gawk} silently supports this usage. However, if +@samp{-W posix} has been specified on the command line +(@pxref{Command Line, ,Invoking @code{awk}}), +it will be treated as an error, since the @sc{posix} standard specifies +that @code{continue} should only be used inside the body of a loop.@refill + +@node Next Statement, Next File Statement, Continue Statement, Statements +@section The @code{next} Statement +@cindex @code{next} statement + +The @code{next} statement forces @code{awk} to immediately stop processing +the current record and go on to the next record. This means that no +further rules are executed for the current record. The rest of the +current rule's action is not executed either. + +Contrast this with the effect of the @code{getline} function +(@pxref{Getline, ,Explicit Input with @code{getline}}). That too causes +@code{awk} to read the next record immediately, but it does not alter the +flow of control in any way. So the rest of the current action executes +with a new input record. + +At the highest level, @code{awk} program execution is a loop that reads +an input record and then tests each rule's pattern against it. If you +think of this loop as a @code{for} statement whose body contains the +rules, then the @code{next} statement is analogous to a @code{continue} +statement: it skips to the end of the body of this implicit loop, and +executes the increment (which reads another record). + +For example, if your @code{awk} program works only on records with four +fields, and you don't want it to fail when given bad input, you might +use this rule near the beginning of the program: + +@smallexample +NF != 4 @{ + printf("line %d skipped: doesn't have 4 fields", FNR) > "/dev/stderr" + next +@} +@end smallexample + +@noindent +so that the following rules will not see the bad record. The error +message is redirected to the standard error output stream, as error +messages should be. @xref{Special Files, ,Standard I/O Streams}. + +According to the @sc{posix} standard, the behavior is undefined if +the @code{next} statement is used in a @code{BEGIN} or @code{END} rule. +@code{gawk} will treat it as a syntax error. + +If the @code{next} statement causes the end of the input to be reached, +then the code in the @code{END} rules, if any, will be executed. +@xref{BEGIN/END, ,@code{BEGIN} and @code{END} Special Patterns}. + +@node Next File Statement, Exit Statement, Next Statement, Statements +@section The @code{next file} Statement + +@cindex @code{next file} statement +The @code{next file} statement is similar to the @code{next} statement. +However, instead of abandoning processing of the current record, the +@code{next file} statement instructs @code{awk} to stop processing the +current data file. + +Upon execution of the @code{next file} statement, @code{FILENAME} is +updated to the name of the next data file listed on the command line, +@code{FNR} is reset to 1, and processing starts over with the first +rule in the progam. @xref{Built-in Variables}. + +If the @code{next file} statement causes the end of the input to be reached, +then the code in the @code{END} rules, if any, will be executed. +@xref{BEGIN/END, ,@code{BEGIN} and @code{END} Special Patterns}. + +The @code{next file} statement is a @code{gawk} extension; it is not +(currently) available in any other @code{awk} implementation. You can +simulate its behavior by creating a library file named @file{nextfile.awk}, +with the following contents. (This sample program uses user-defined +functions, a feature that has not been presented yet. +@xref{User-defined, ,User-defined Functions}, +for more information.)@refill + +@smallexample +# nextfile --- function to skip remaining records in current file + +# this should be read in before the "main" awk program + +function nextfile() @{ _abandon_ = FILENAME; next @} + +_abandon_ == FILENAME && FNR > 1 @{ next @} +_abandon_ == FILENAME && FNR == 1 @{ _abandon_ = "" @} +@end smallexample + +The @code{nextfile} function simply sets a ``private'' variable@footnote{Since +all variables in @code{awk} are global, this program uses the common +practice of prefixing the variable name with an underscore. In fact, it +also suffixes the variable name with an underscore, as extra insurance +against using a variable name that might be used in some other library +file.} to the name of the current data file, and then retrieves the next +record. Since this file is read before the main @code{awk} program, +the rules that follows the function definition will be executed before the +rules in the main program. The first rule continues to skip records as long as +the name of the input file has not changed, and this is not the first +record in the file. This rule is sufficient most of the time. But what if +the @emph{same} data file is named twice in a row on the command line? +This rule would not process the data file the second time. The second rule +catches this case: If the data file name is what was being skipped, but +@code{FNR} is 1, then this is the second time the file is being processed, +and it should not be skipped. + +The @code{next file} statement would be useful if you have many data +files to process, and due to the nature of the data, you expect that you +would not want to process every record in the file. In order to move on to +the next data file, you would have to continue scanning the unwanted +records (as described above). The @code{next file} statement accomplishes +this much more efficiently. + +@ignore +Would it make sense down the road to nuke `next file' in favor of +semantics that would make this work? + + function nextfile() { ARGIND++ ; next } +@end ignore + +@node Exit Statement, , Next File Statement, Statements +@section The @code{exit} Statement + +@cindex @code{exit} statement +The @code{exit} statement causes @code{awk} to immediately stop +executing the current rule and to stop processing input; any remaining input +is ignored.@refill + +If an @code{exit} statement is executed from a @code{BEGIN} rule the +program stops processing everything immediately. No input records are +read. However, if an @code{END} rule is present, it is executed +(@pxref{BEGIN/END, ,@code{BEGIN} and @code{END} Special Patterns}). + +If @code{exit} is used as part of an @code{END} rule, it causes +the program to stop immediately. + +An @code{exit} statement that is part of an ordinary rule (that is, not part +of a @code{BEGIN} or @code{END} rule) stops the execution of any further +automatic rules, but the @code{END} rule is executed if there is one. +If you do not want the @code{END} rule to do its job in this case, you +can set a variable to nonzero before the @code{exit} statement, and check +that variable in the @code{END} rule. + +If an argument is supplied to @code{exit}, its value is used as the exit +status code for the @code{awk} process. If no argument is supplied, +@code{exit} returns status zero (success).@refill + +For example, let's say you've discovered an error condition you really +don't know how to handle. Conventionally, programs report this by +exiting with a nonzero status. Your @code{awk} program can do this +using an @code{exit} statement with a nonzero argument. Here's an +example of this:@refill + +@example +@group +BEGIN @{ + if (("date" | getline date_now) < 0) @{ + print "Can't get system date" > "/dev/stderr" + exit 4 + @} +@} +@end group +@end example + +@node Arrays, Built-in, Statements, Top +@chapter Arrays in @code{awk} + +An @dfn{array} is a table of values, called @dfn{elements}. The +elements of an array are distinguished by their indices. @dfn{Indices} +may be either numbers or strings. Each array has a name, which looks +like a variable name, but must not be in use as a variable name in the +same @code{awk} program. + +@menu +* Array Intro:: Introduction to Arrays +* Reference to Elements:: How to examine one element of an array. +* Assigning Elements:: How to change an element of an array. +* Array Example:: Basic Example of an Array +* Scanning an Array:: A variation of the @code{for} statement. + It loops through the indices of + an array's existing elements. +* Delete:: The @code{delete} statement removes + an element from an array. +* Numeric Array Subscripts:: How to use numbers as subscripts in @code{awk}. +* Multi-dimensional:: Emulating multi-dimensional arrays in @code{awk}. +* Multi-scanning:: Scanning multi-dimensional arrays. +@end menu + +@node Array Intro, Reference to Elements, Arrays, Arrays +@section Introduction to Arrays + +@cindex arrays +The @code{awk} language has one-dimensional @dfn{arrays} for storing groups +of related strings or numbers. + +Every @code{awk} array must have a name. Array names have the same +syntax as variable names; any valid variable name would also be a valid +array name. But you cannot use one name in both ways (as an array and +as a variable) in one @code{awk} program. + +Arrays in @code{awk} superficially resemble arrays in other programming +languages; but there are fundamental differences. In @code{awk}, you +don't need to specify the size of an array before you start to use it. +Additionally, any number or string in @code{awk} may be used as an +array index. + +In most other languages, you have to @dfn{declare} an array and specify +how many elements or components it contains. In such languages, the +declaration causes a contiguous block of memory to be allocated for that +many elements. An index in the array must be a positive integer; for +example, the index 0 specifies the first element in the array, which is +actually stored at the beginning of the block of memory. Index 1 +specifies the second element, which is stored in memory right after the +first element, and so on. It is impossible to add more elements to the +array, because it has room for only as many elements as you declared. + +A contiguous array of four elements might look like this, +conceptually, if the element values are @code{8}, @code{"foo"}, +@code{""} and @code{30}:@refill + +@example ++---------+---------+--------+---------+ +| 8 | "foo" | "" | 30 | @r{value} ++---------+---------+--------+---------+ + 0 1 2 3 @r{index} +@end example + +@noindent +Only the values are stored; the indices are implicit from the order of +the values. @code{8} is the value at index 0, because @code{8} appears in the +position with 0 elements before it. + +@cindex arrays, definition of +@cindex associative arrays +Arrays in @code{awk} are different: they are @dfn{associative}. This means +that each array is a collection of pairs: an index, and its corresponding +array element value: + +@example +@r{Element} 4 @r{Value} 30 +@r{Element} 2 @r{Value} "foo" +@r{Element} 1 @r{Value} 8 +@r{Element} 3 @r{Value} "" +@end example + +@noindent +We have shown the pairs in jumbled order because their order is irrelevant. + +One advantage of an associative array is that new pairs can be added +at any time. For example, suppose we add to the above array a tenth element +whose value is @w{@code{"number ten"}}. The result is this: + +@example +@r{Element} 10 @r{Value} "number ten" +@r{Element} 4 @r{Value} 30 +@r{Element} 2 @r{Value} "foo" +@r{Element} 1 @r{Value} 8 +@r{Element} 3 @r{Value} "" +@end example + +@noindent +Now the array is @dfn{sparse} (i.e., some indices are missing): it has +elements 1--4 and 10, but doesn't have elements 5, 6, 7, 8, or 9.@refill + +Another consequence of associative arrays is that the indices don't +have to be positive integers. Any number, or even a string, can be +an index. For example, here is an array which translates words from +English into French: + +@example +@r{Element} "dog" @r{Value} "chien" +@r{Element} "cat" @r{Value} "chat" +@r{Element} "one" @r{Value} "un" +@r{Element} 1 @r{Value} "un" +@end example + +@noindent +Here we decided to translate the number 1 in both spelled-out and +numeric form---thus illustrating that a single array can have both +numbers and strings as indices. + +When @code{awk} creates an array for you, e.g., with the @code{split} +built-in function, +that array's indices are consecutive integers starting at 1. +(@xref{String Functions, ,Built-in Functions for String Manipulation}.) + +@node Reference to Elements, Assigning Elements, Array Intro, Arrays +@section Referring to an Array Element +@cindex array reference +@cindex element of array +@cindex reference to array + +The principal way of using an array is to refer to one of its elements. +An array reference is an expression which looks like this: + +@example +@var{array}[@var{index}] +@end example + +@noindent +Here, @var{array} is the name of an array. The expression @var{index} is +the index of the element of the array that you want. + +The value of the array reference is the current value of that array +element. For example, @code{foo[4.3]} is an expression for the element +of array @code{foo} at index 4.3. + +If you refer to an array element that has no recorded value, the value +of the reference is @code{""}, the null string. This includes elements +to which you have not assigned any value, and elements that have been +deleted (@pxref{Delete, ,The @code{delete} Statement}). Such a reference +automatically creates that array element, with the null string as its value. +(In some cases, this is unfortunate, because it might waste memory inside +@code{awk}). + +@cindex arrays, presence of elements +You can find out if an element exists in an array at a certain index with +the expression: + +@example +@var{index} in @var{array} +@end example + +@noindent +This expression tests whether or not the particular index exists, +without the side effect of creating that element if it is not present. +The expression has the value 1 (true) if @code{@var{array}[@var{index}]} +exists, and 0 (false) if it does not exist.@refill + +For example, to test whether the array @code{frequencies} contains the +index @code{"2"}, you could write this statement:@refill + +@smallexample +if ("2" in frequencies) print "Subscript \"2\" is present." +@end smallexample + +Note that this is @emph{not} a test of whether or not the array +@code{frequencies} contains an element whose @emph{value} is @code{"2"}. +(There is no way to do that except to scan all the elements.) Also, this +@emph{does not} create @code{frequencies["2"]}, while the following +(incorrect) alternative would do so:@refill + +@smallexample +if (frequencies["2"] != "") print "Subscript \"2\" is present." +@end smallexample + +@node Assigning Elements, Array Example, Reference to Elements, Arrays +@section Assigning Array Elements +@cindex array assignment +@cindex element assignment + +Array elements are lvalues: they can be assigned values just like +@code{awk} variables: + +@example +@var{array}[@var{subscript}] = @var{value} +@end example + +@noindent +Here @var{array} is the name of your array. The expression +@var{subscript} is the index of the element of the array that you want +to assign a value. The expression @var{value} is the value you are +assigning to that element of the array.@refill + +@node Array Example, Scanning an Array, Assigning Elements, Arrays +@section Basic Example of an Array + +The following program takes a list of lines, each beginning with a line +number, and prints them out in order of line number. The line numbers are +not in order, however, when they are first read: they are scrambled. This +program sorts the lines by making an array using the line numbers as +subscripts. It then prints out the lines in sorted order of their numbers. +It is a very simple program, and gets confused if it encounters repeated +numbers, gaps, or lines that don't begin with a number.@refill + +@example +@{ + if ($1 > max) + max = $1 + arr[$1] = $0 +@} + +END @{ + for (x = 1; x <= max; x++) + print arr[x] +@} +@end example + +The first rule keeps track of the largest line number seen so far; +it also stores each line into the array @code{arr}, at an index that +is the line's number. + +The second rule runs after all the input has been read, to print out +all the lines. + +When this program is run with the following input: + +@example +5 I am the Five man +2 Who are you? The new number two! +4 . . . And four on the floor +1 Who is number one? +3 I three you. +@end example + +@noindent +its output is this: + +@example +1 Who is number one? +2 Who are you? The new number two! +3 I three you. +4 . . . And four on the floor +5 I am the Five man +@end example + +If a line number is repeated, the last line with a given number overrides +the others. + +Gaps in the line numbers can be handled with an easy improvement to the +program's @code{END} rule: + +@example +END @{ + for (x = 1; x <= max; x++) + if (x in arr) + print arr[x] +@} +@end example + +@node Scanning an Array, Delete, Array Example, Arrays +@section Scanning all Elements of an Array +@cindex @code{for (x in @dots{})} +@cindex arrays, special @code{for} statement +@cindex scanning an array + +In programs that use arrays, often you need a loop that executes +once for each element of an array. In other languages, where arrays are +contiguous and indices are limited to positive integers, this is +easy: the largest index is one less than the length of the array, and you can +find all the valid indices by counting from zero up to that value. This +technique won't do the job in @code{awk}, since any number or string +may be an array index. So @code{awk} has a special kind of @code{for} +statement for scanning an array: + +@example +for (@var{var} in @var{array}) + @var{body} +@end example + +@noindent +This loop executes @var{body} once for each different value that your +program has previously used as an index in @var{array}, with the +variable @var{var} set to that index.@refill + +Here is a program that uses this form of the @code{for} statement. The +first rule scans the input records and notes which words appear (at +least once) in the input, by storing a 1 into the array @code{used} with +the word as index. The second rule scans the elements of @code{used} to +find all the distinct words that appear in the input. It prints each +word that is more than 10 characters long, and also prints the number of +such words. @xref{Built-in, ,Built-in Functions}, for more information +on the built-in function @code{length}. + +@smallexample +# Record a 1 for each word that is used at least once. +@{ + for (i = 1; i <= NF; i++) + used[$i] = 1 +@} + +# Find number of distinct words more than 10 characters long. +END @{ + for (x in used) + if (length(x) > 10) @{ + ++num_long_words + print x + @} + print num_long_words, "words longer than 10 characters" +@} +@end smallexample + +@noindent +@xref{Sample Program}, for a more detailed example of this type. + +The order in which elements of the array are accessed by this statement +is determined by the internal arrangement of the array elements within +@code{awk} and cannot be controlled or changed. This can lead to +problems if new elements are added to @var{array} by statements in +@var{body}; you cannot predict whether or not the @code{for} loop will +reach them. Similarly, changing @var{var} inside the loop can produce +strange results. It is best to avoid such things.@refill + +@node Delete, Numeric Array Subscripts, Scanning an Array, Arrays +@section The @code{delete} Statement +@cindex @code{delete} statement +@cindex deleting elements of arrays +@cindex removing elements of arrays +@cindex arrays, deleting an element + +You can remove an individual element of an array using the @code{delete} +statement: + +@example +delete @var{array}[@var{index}] +@end example + +You can not refer to an array element after it has been deleted; +it is as if you had never referred +to it and had never given it any value. You can no longer obtain any +value the element once had. + +Here is an example of deleting elements in an array: + +@example +for (i in frequencies) + delete frequencies[i] +@end example + +@noindent +This example removes all the elements from the array @code{frequencies}. + +If you delete an element, a subsequent @code{for} statement to scan the array +will not report that element, and the @code{in} operator to check for +the presence of that element will return 0: + +@example +delete foo[4] +if (4 in foo) + print "This will never be printed" +@end example + +It is not an error to delete an element which does not exist. + +@node Numeric Array Subscripts, Multi-dimensional, Delete, Arrays +@section Using Numbers to Subscript Arrays + +An important aspect of arrays to remember is that array subscripts +are @emph{always} strings. If you use a numeric value as a subscript, +it will be converted to a string value before it is used for subscripting +(@pxref{Conversion, ,Conversion of Strings and Numbers}). + +@cindex conversions, during subscripting +@cindex numbers, used as subscripts +@vindex CONVFMT +This means that the value of the @code{CONVFMT} can potentially +affect how your program accesses elements of an array. For example: + +@example +a = b = 12.153 +data[a] = 1 +CONVFMT = "%2.2f" +if (b in data) + printf "%s is in data", b +else + printf "%s is not in data", b +@end example + +@noindent +should print @samp{12.15 is not in data}. The first statement gives +both @code{a} and @code{b} the same numeric value. Assigning to +@code{data[a]} first gives @code{a} the string value @code{"12.153"} +(using the default conversion value of @code{CONVFMT}, @code{"%.6g"}), +and then assigns 1 to @code{data["12.153"]}. The program then changes +the value of @code{CONVFMT}. The test @samp{(b in data)} forces @code{b} +to be converted to a string, this time @code{"12.15"}, since the value of +@code{CONVFMT} only allows two significant digits. This test fails, +since @code{"12.15"} is a different string from @code{"12.153"}.@refill + +According to the rules for conversions +(@pxref{Conversion, ,Conversion of Strings and Numbers}), integer +values are always converted to strings as integers, no matter what the +value of @code{CONVFMT} may happen to be. So the usual case of@refill + +@example +for (i = 1; i <= maxsub; i++) + @i{do something with} array[i] +@end example + +@noindent +will work, no matter what the value of @code{CONVFMT}. + +Like many things in @code{awk}, the majority of the time things work +as you would expect them to work. But it is useful to have a precise +knowledge of the actual rules, since sometimes they can have a subtle +effect on your programs. + +@node Multi-dimensional, Multi-scanning, Numeric Array Subscripts, Arrays +@section Multi-dimensional Arrays + +@c the following index entry is an overfull hbox. --mew 30jan1992 +@cindex subscripts in arrays +@cindex arrays, multi-dimensional subscripts +@cindex multi-dimensional subscripts +A multi-dimensional array is an array in which an element is identified +by a sequence of indices, not a single index. For example, a +two-dimensional array requires two indices. The usual way (in most +languages, including @code{awk}) to refer to an element of a +two-dimensional array named @code{grid} is with +@code{grid[@var{x},@var{y}]}. + +@vindex SUBSEP +Multi-dimensional arrays are supported in @code{awk} through +concatenation of indices into one string. What happens is that +@code{awk} converts the indices into strings +(@pxref{Conversion, ,Conversion of Strings and Numbers}) and +concatenates them together, with a separator between them. This creates +a single string that describes the values of the separate indices. The +combined string is used as a single index into an ordinary, +one-dimensional array. The separator used is the value of the built-in +variable @code{SUBSEP}.@refill + +For example, suppose we evaluate the expression @code{foo[5,12]="value"} +when the value of @code{SUBSEP} is @code{"@@"}. The numbers 5 and 12 are +converted to strings and +concatenated with an @samp{@@} between them, yielding @code{"5@@12"}; thus, +the array element @code{foo["5@@12"]} is set to @code{"value"}.@refill + +Once the element's value is stored, @code{awk} has no record of whether +it was stored with a single index or a sequence of indices. The two +expressions @code{foo[5,12]} and @w{@code{foo[5 SUBSEP 12]}} always have +the same value. + +The default value of @code{SUBSEP} is the string @code{"\034"}, +which contains a nonprinting character that is unlikely to appear in an +@code{awk} program or in the input data. + +The usefulness of choosing an unlikely character comes from the fact +that index values that contain a string matching @code{SUBSEP} lead to +combined strings that are ambiguous. Suppose that @code{SUBSEP} were +@code{"@@"}; then @w{@code{foo["a@@b", "c"]}} and @w{@code{foo["a", +"b@@c"]}} would be indistinguishable because both would actually be +stored as @code{foo["a@@b@@c"]}. Because @code{SUBSEP} is +@code{"\034"}, such confusion can arise only when an index +contains the character with ASCII code 034, which is a rare +event.@refill + +You can test whether a particular index-sequence exists in a +``multi-dimensional'' array with the same operator @code{in} used for single +dimensional arrays. Instead of a single index as the left-hand operand, +write the whole sequence of indices, separated by commas, in +parentheses:@refill + +@example +(@var{subscript1}, @var{subscript2}, @dots{}) in @var{array} +@end example + +The following example treats its input as a two-dimensional array of +fields; it rotates this array 90 degrees clockwise and prints the +result. It assumes that all lines have the same number of +elements. + +@example +awk '@{ + if (max_nf < NF) + max_nf = NF + max_nr = NR + for (x = 1; x <= NF; x++) + vector[x, NR] = $x +@} + +END @{ + for (x = 1; x <= max_nf; x++) @{ + for (y = max_nr; y >= 1; --y) + printf("%s ", vector[x, y]) + printf("\n") + @} +@}' +@end example + +@noindent +When given the input: + +@example +@group +1 2 3 4 5 6 +2 3 4 5 6 1 +3 4 5 6 1 2 +4 5 6 1 2 3 +@end group +@end example + +@noindent +it produces: + +@example +@group +4 3 2 1 +5 4 3 2 +6 5 4 3 +1 6 5 4 +2 1 6 5 +3 2 1 6 +@end group +@end example + +@node Multi-scanning, , Multi-dimensional, Arrays +@section Scanning Multi-dimensional Arrays + +There is no special @code{for} statement for scanning a +``multi-dimensional'' array; there cannot be one, because in truth there +are no multi-dimensional arrays or elements; there is only a +multi-dimensional @emph{way of accessing} an array. + +However, if your program has an array that is always accessed as +multi-dimensional, you can get the effect of scanning it by combining +the scanning @code{for} statement +(@pxref{Scanning an Array, ,Scanning all Elements of an Array}) with the +@code{split} built-in function +(@pxref{String Functions, ,Built-in Functions for String Manipulation}). +It works like this:@refill + +@example +for (combined in @var{array}) @{ + split(combined, separate, SUBSEP) + @dots{} +@} +@end example + +@noindent +This finds each concatenated, combined index in the array, and splits it +into the individual indices by breaking it apart where the value of +@code{SUBSEP} appears. The split-out indices become the elements of +the array @code{separate}. + +Thus, suppose you have previously stored in @code{@var{array}[1, +"foo"]}; then an element with index @code{"1\034foo"} exists in +@var{array}. (Recall that the default value of @code{SUBSEP} contains +the character with code 034.) Sooner or later the @code{for} statement +will find that index and do an iteration with @code{combined} set to +@code{"1\034foo"}. Then the @code{split} function is called as +follows: + +@example +split("1\034foo", separate, "\034") +@end example + +@noindent +The result of this is to set @code{separate[1]} to 1 and @code{separate[2]} +to @code{"foo"}. Presto, the original sequence of separate indices has +been recovered. + +@node Built-in, User-defined, Arrays, Top +@chapter Built-in Functions + +@cindex built-in functions +@dfn{Built-in} functions are functions that are always available for +your @code{awk} program to call. This chapter defines all the built-in +functions in @code{awk}; some of them are mentioned in other sections, +but they are summarized here for your convenience. (You can also define +new functions yourself. @xref{User-defined, ,User-defined Functions}.) + +@menu +* Calling Built-in:: How to call built-in functions. +* Numeric Functions:: Functions that work with numbers, + including @code{int}, @code{sin} and @code{rand}. +* String Functions:: Functions for string manipulation, + such as @code{split}, @code{match}, and @code{sprintf}. +* I/O Functions:: Functions for files and shell commands. +* Time Functions:: Functions for dealing with time stamps. +@end menu + +@node Calling Built-in, Numeric Functions, Built-in, Built-in +@section Calling Built-in Functions + +To call a built-in function, write the name of the function followed +by arguments in parentheses. For example, @code{atan2(y + z, 1)} +is a call to the function @code{atan2}, with two arguments. + +Whitespace is ignored between the built-in function name and the +open-parenthesis, but we recommend that you avoid using whitespace +there. User-defined functions do not permit whitespace in this way, and +you will find it easier to avoid mistakes by following a simple +convention which always works: no whitespace after a function name. + +Each built-in function accepts a certain number of arguments. In most +cases, any extra arguments given to built-in functions are ignored. The +defaults for omitted arguments vary from function to function and are +described under the individual functions. + +When a function is called, expressions that create the function's actual +parameters are evaluated completely before the function call is performed. +For example, in the code fragment: + +@example +i = 4 +j = sqrt(i++) +@end example + +@noindent +the variable @code{i} is set to 5 before @code{sqrt} is called +with a value of 4 for its actual parameter. + +@node Numeric Functions, String Functions, Calling Built-in, Built-in +@section Numeric Built-in Functions +@c I didn't make all the examples small because a couple of them were +@c short already. --mew 29jan1992 + +Here is a full list of built-in functions that work with numbers: + +@table @code +@item int(@var{x}) +This gives you the integer part of @var{x}, truncated toward 0. This +produces the nearest integer to @var{x}, located between @var{x} and 0. + +For example, @code{int(3)} is 3, @code{int(3.9)} is 3, @code{int(-3.9)} +is @minus{}3, and @code{int(-3)} is @minus{}3 as well.@refill + +@item sqrt(@var{x}) +This gives you the positive square root of @var{x}. It reports an error +if @var{x} is negative. Thus, @code{sqrt(4)} is 2.@refill + +@item exp(@var{x}) +This gives you the exponential of @var{x}, or reports an error if +@var{x} is out of range. The range of values @var{x} can have depends +on your machine's floating point representation.@refill + +@item log(@var{x}) +This gives you the natural logarithm of @var{x}, if @var{x} is positive; +otherwise, it reports an error.@refill + +@item sin(@var{x}) +This gives you the sine of @var{x}, with @var{x} in radians. + +@item cos(@var{x}) +This gives you the cosine of @var{x}, with @var{x} in radians. + +@item atan2(@var{y}, @var{x}) +This gives you the arctangent of @code{@var{y} / @var{x}} in radians. + +@item rand() +This gives you a random number. The values of @code{rand} are +uniformly-distributed between 0 and 1. The value is never 0 and never +1. + +Often you want random integers instead. Here is a user-defined function +you can use to obtain a random nonnegative integer less than @var{n}: + +@example +function randint(n) @{ + return int(n * rand()) +@} +@end example + +@noindent +The multiplication produces a random real number greater than 0 and less +than @var{n}. We then make it an integer (using @code{int}) between 0 +and @code{@var{n} @minus{} 1}. + +Here is an example where a similar function is used to produce +random integers between 1 and @var{n}. Note that this program will +print a new random number for each input record. + +@smallexample +awk ' +# Function to roll a simulated die. +function roll(n) @{ return 1 + int(rand() * n) @} + +# Roll 3 six-sided dice and print total number of points. +@{ + printf("%d points\n", roll(6)+roll(6)+roll(6)) +@}' +@end smallexample + +@strong{Note:} @code{rand} starts generating numbers from the same +point, or @dfn{seed}, each time you run @code{awk}. This means that +a program will produce the same results each time you run it. +The numbers are random within one @code{awk} run, but predictable +from run to run. This is convenient for debugging, but if you want +a program to do different things each time it is used, you must change +the seed to a value that will be different in each run. To do this, +use @code{srand}. + +@item srand(@var{x}) +The function @code{srand} sets the starting point, or @dfn{seed}, +for generating random numbers to the value @var{x}. + +Each seed value leads to a particular sequence of ``random'' numbers. +Thus, if you set the seed to the same value a second time, you will get +the same sequence of ``random'' numbers again. + +If you omit the argument @var{x}, as in @code{srand()}, then the current +date and time of day are used for a seed. This is the way to get random +numbers that are truly unpredictable. + +The return value of @code{srand} is the previous seed. This makes it +easy to keep track of the seeds for use in consistently reproducing +sequences of random numbers. +@end table + +@node String Functions, I/O Functions, Numeric Functions, Built-in +@section Built-in Functions for String Manipulation + +The functions in this section look at or change the text of one or more +strings. + +@table @code +@item index(@var{in}, @var{find}) +@findex match +This searches the string @var{in} for the first occurrence of the string +@var{find}, and returns the position in characters where that occurrence +begins in the string @var{in}. For example:@refill + +@smallexample +awk 'BEGIN @{ print index("peanut", "an") @}' +@end smallexample + +@noindent +prints @samp{3}. If @var{find} is not found, @code{index} returns 0. +(Remember that string indices in @code{awk} start at 1.) + +@item length(@var{string}) +@findex length +This gives you the number of characters in @var{string}. If +@var{string} is a number, the length of the digit string representing +that number is returned. For example, @code{length("abcde")} is 5. By +contrast, @code{length(15 * 35)} works out to 3. How? Well, 15 * 35 = +525, and 525 is then converted to the string @samp{"525"}, which has +three characters. + +If no argument is supplied, @code{length} returns the length of @code{$0}. + +In older versions of @code{awk}, you could call the @code{length} function +without any parentheses. Doing so is marked as ``deprecated'' in the +@sc{posix} standard. This means that while you can do this in your +programs, it is a feature that can eventually be removed from a future +version of the standard. Therefore, for maximal portability of your +@code{awk} programs you should always supply the parentheses. + +@item match(@var{string}, @var{regexp}) +@findex match +The @code{match} function searches the string, @var{string}, for the +longest, leftmost substring matched by the regular expression, +@var{regexp}. It returns the character position, or @dfn{index}, of +where that substring begins (1, if it starts at the beginning of +@var{string}). If no match if found, it returns 0. + +@vindex RSTART +@vindex RLENGTH +The @code{match} function sets the built-in variable @code{RSTART} to +the index. It also sets the built-in variable @code{RLENGTH} to the +length in characters of the matched substring. If no match is found, +@code{RSTART} is set to 0, and @code{RLENGTH} to @minus{}1. + +For example: + +@smallexample +awk '@{ + if ($1 == "FIND") + regex = $2 + else @{ + where = match($0, regex) + if (where) + print "Match of", regex, "found at", where, "in", $0 + @} +@}' +@end smallexample + +@noindent +This program looks for lines that match the regular expression stored in +the variable @code{regex}. This regular expression can be changed. If the +first word on a line is @samp{FIND}, @code{regex} is changed to be the +second word on that line. Therefore, given: + +@smallexample +FIND fo*bar +My program was a foobar +But none of it would doobar +FIND Melvin +JF+KM +This line is property of The Reality Engineering Co. +This file created by Melvin. +@end smallexample + +@noindent +@code{awk} prints: + +@smallexample +Match of fo*bar found at 18 in My program was a foobar +Match of Melvin found at 26 in This file created by Melvin. +@end smallexample + +@item split(@var{string}, @var{array}, @var{fieldsep}) +@findex split +This divides @var{string} into pieces separated by @var{fieldsep}, +and stores the pieces in @var{array}. The first piece is stored in +@code{@var{array}[1]}, the second piece in @code{@var{array}[2]}, and so +forth. The string value of the third argument, @var{fieldsep}, is +a regexp describing where to split @var{string} (much as @code{FS} can +be a regexp describing where to split input records). If +the @var{fieldsep} is omitted, the value of @code{FS} is used. +@code{split} returns the number of elements created.@refill + +The @code{split} function, then, splits strings into pieces in a +manner similar to the way input lines are split into fields. For example: + +@smallexample +split("auto-da-fe", a, "-") +@end smallexample + +@noindent +splits the string @samp{auto-da-fe} into three fields using @samp{-} as the +separator. It sets the contents of the array @code{a} as follows: + +@smallexample +a[1] = "auto" +a[2] = "da" +a[3] = "fe" +@end smallexample + +@noindent +The value returned by this call to @code{split} is 3. + +As with input field-splitting, when the value of @var{fieldsep} is +@code{" "}, leading and trailing whitespace is ignored, and the elements +are separated by runs of whitespace. + +@item sprintf(@var{format}, @var{expression1},@dots{}) +@findex sprintf +This returns (without printing) the string that @code{printf} would +have printed out with the same arguments +(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}). +For example:@refill + +@smallexample +sprintf("pi = %.2f (approx.)", 22/7) +@end smallexample + +@noindent +returns the string @w{@code{"pi = 3.14 (approx.)"}}. + +@item sub(@var{regexp}, @var{replacement}, @var{target}) +@findex sub +The @code{sub} function alters the value of @var{target}. +It searches this value, which should be a string, for the +leftmost substring matched by the regular expression, @var{regexp}, +extending this match as far as possible. Then the entire string is +changed by replacing the matched text with @var{replacement}. +The modified string becomes the new value of @var{target}. + +This function is peculiar because @var{target} is not simply +used to compute a value, and not just any expression will do: it +must be a variable, field or array reference, so that @code{sub} can +store a modified value there. If this argument is omitted, then the +default is to use and alter @code{$0}. + +For example:@refill + +@smallexample +str = "water, water, everywhere" +sub(/at/, "ith", str) +@end smallexample + +@noindent +sets @code{str} to @w{@code{"wither, water, everywhere"}}, by replacing the +leftmost, longest occurrence of @samp{at} with @samp{ith}. + +The @code{sub} function returns the number of substitutions made (either +one or zero). + +If the special character @samp{&} appears in @var{replacement}, it +stands for the precise substring that was matched by @var{regexp}. (If +the regexp can match more than one string, then this precise substring +may vary.) For example:@refill + +@smallexample +awk '@{ sub(/candidate/, "& and his wife"); print @}' +@end smallexample + +@noindent +changes the first occurrence of @samp{candidate} to @samp{candidate +and his wife} on each input line. + +Here is another example: + +@smallexample +awk 'BEGIN @{ + str = "daabaaa" + sub(/a*/, "c&c", str) + print str +@}' +@end smallexample + +@noindent +prints @samp{dcaacbaaa}. This show how @samp{&} can represent a non-constant +string, and also illustrates the ``leftmost, longest'' rule. + +The effect of this special character (@samp{&}) can be turned off by putting a +backslash before it in the string. As usual, to insert one backslash in +the string, you must write two backslashes. Therefore, write @samp{\\&} +in a string constant to include a literal @samp{&} in the replacement. +For example, here is how to replace the first @samp{|} on each line with +an @samp{&}:@refill + +@smallexample +awk '@{ sub(/\|/, "\\&"); print @}' +@end smallexample + +@strong{Note:} as mentioned above, the third argument to @code{sub} must +be an lvalue. Some versions of @code{awk} allow the third argument to +be an expression which is not an lvalue. In such a case, @code{sub} +would still search for the pattern and return 0 or 1, but the result of +the substitution (if any) would be thrown away because there is no place +to put it. Such versions of @code{awk} accept expressions like +this:@refill + +@smallexample +sub(/USA/, "United States", "the USA and Canada") +@end smallexample + +@noindent +But that is considered erroneous in @code{gawk}. + +@item gsub(@var{regexp}, @var{replacement}, @var{target}) +@findex gsub +This is similar to the @code{sub} function, except @code{gsub} replaces +@emph{all} of the longest, leftmost, @emph{nonoverlapping} matching +substrings it can find. The @samp{g} in @code{gsub} stands for +``global,'' which means replace everywhere. For example:@refill + +@smallexample +awk '@{ gsub(/Britain/, "United Kingdom"); print @}' +@end smallexample + +@noindent +replaces all occurrences of the string @samp{Britain} with @samp{United +Kingdom} for all input records.@refill + +The @code{gsub} function returns the number of substitutions made. If +the variable to be searched and altered, @var{target}, is +omitted, then the entire input record, @code{$0}, is used.@refill + +As in @code{sub}, the characters @samp{&} and @samp{\} are special, and +the third argument must be an lvalue. + +@item substr(@var{string}, @var{start}, @var{length}) +@findex substr +This returns a @var{length}-character-long substring of @var{string}, +starting at character number @var{start}. The first character of a +string is character number one. For example, +@code{substr("washington", 5, 3)} returns @code{"ing"}.@refill + +If @var{length} is not present, this function returns the whole suffix of +@var{string} that begins at character number @var{start}. For example, +@code{substr("washington", 5)} returns @code{"ington"}. This is also +the case if @var{length} is greater than the number of characters remaining +in the string, counting from character number @var{start}. + +@item tolower(@var{string}) +@findex tolower +This returns a copy of @var{string}, with each upper-case character +in the string replaced with its corresponding lower-case character. +Nonalphabetic characters are left unchanged. For example, +@code{tolower("MiXeD cAsE 123")} returns @code{"mixed case 123"}. + +@item toupper(@var{string}) +@findex toupper +This returns a copy of @var{string}, with each lower-case character +in the string replaced with its corresponding upper-case character. +Nonalphabetic characters are left unchanged. For example, +@code{toupper("MiXeD cAsE 123")} returns @code{"MIXED CASE 123"}. +@end table + +@node I/O Functions, Time Functions, String Functions, Built-in +@section Built-in Functions for Input/Output + +@table @code +@item close(@var{filename}) +Close the file @var{filename}, for input or output. The argument may +alternatively be a shell command that was used for redirecting to or +from a pipe; then the pipe is closed. + +@xref{Close Input, ,Closing Input Files and Pipes}, regarding closing +input files and pipes. @xref{Close Output, ,Closing Output Files and Pipes}, +regarding closing output files and pipes.@refill + +@item system(@var{command}) +@findex system +@c the following index entry is an overfull hbox. --mew 30jan1992 +@cindex interaction, @code{awk} and other programs +The system function allows the user to execute operating system commands +and then return to the @code{awk} program. The @code{system} function +executes the command given by the string @var{command}. It returns, as +its value, the status returned by the command that was executed. + +For example, if the following fragment of code is put in your @code{awk} +program: + +@smallexample +END @{ + system("mail -s 'awk run done' operator < /dev/null") +@} +@end smallexample + +@noindent +the system operator will be sent mail when the @code{awk} program +finishes processing input and begins its end-of-input processing. + +Note that much the same result can be obtained by redirecting +@code{print} or @code{printf} into a pipe. However, if your @code{awk} +program is interactive, @code{system} is useful for cranking up large +self-contained programs, such as a shell or an editor.@refill + +Some operating systems cannot implement the @code{system} function. +@code{system} causes a fatal error if it is not supported. +@end table + +@c fakenode --- for prepinfo +@subheading Controlling Output Buffering with @code{system} +@cindex flushing buffers +@cindex buffers, flushing +@cindex buffering output +@cindex output, buffering + +Many utility programs will @dfn{buffer} their output; they save information +to be written to a disk file or terminal in memory, until there is enough +to be written in one operation. This is often more efficient than writing +every little bit of information as soon as it is ready. However, sometimes +it is necessary to force a program to @dfn{flush} its buffers; that is, +write the information to its destination, even if a buffer is not full. +You can do this from your @code{awk} program by calling @code{system} +with a null string as its argument: + +@example +system("") # flush output +@end example + +@noindent +@code{gawk} treats this use of the @code{system} function as a special +case, and is smart enough not to run a shell (or other command +interpreter) with the empty command. Therefore, with @code{gawk}, this +idiom is not only useful, it is efficient. While this idiom should work +with other @code{awk} implementations, it will not necessarily avoid +starting an unnecessary shell. +@ignore +Need a better explanation, perhaps in a separate paragraph. Explain that +for + +awk 'BEGIN { print "hi" + system("echo hello") + print "howdy" }' + +that the output had better be + + hi + hello + howdy + +and not + + hello + hi + howdy + +which it would be if awk did not flush its buffers before calling system. +@end ignore + +@node Time Functions, , I/O Functions, Built-in +@section Functions for Dealing with Time Stamps + +@cindex time stamps +@cindex time of day +A common use for @code{awk} programs is the processing of log files. +Log files often contain time stamp information, indicating when a +particular log record was written. Many programs log their time stamp +in the form returned by the @code{time} system call, which is the +number of seconds since a particular epoch. On @sc{posix} systems, +it is the number of seconds since Midnight, January 1, 1970, @sc{utc}. + +In order to make it easier to process such log files, and to easily produce +useful reports, @code{gawk} provides two functions for working with time +stamps. Both of these are @code{gawk} extensions; they are not specified +in the @sc{posix} standard, nor are they in any other known version +of @code{awk}. + +@table @code +@item systime() +@findex systime +This function returns the current time as the number of seconds since +the system epoch. On @sc{posix} systems, this is the number of seconds +since Midnight, January 1, 1970, @sc{utc}. It may be a different number on +other systems. + +@item strftime(@var{format}, @var{timestamp}) +@findex strftime +This function returns a string. It is similar to the function of the +same name in the @sc{ansi} C standard library. The time specified by +@var{timestamp} is used to produce a string, based on the contents +of the @var{format} string. +@end table + +The @code{systime} function allows you to compare a time stamp from a +log file with the current time of day. In particular, it is easy to +determine how long ago a particular record was logged. It also allows +you to produce log records using the ``seconds since the epoch'' format. + +The @code{strftime} function allows you to easily turn a time stamp +into human-readable information. It is similar in nature to the @code{sprintf} +function, copying non-format specification characters verbatim to the +returned string, and substituting date and time values for format +specifications in the @var{format} string. If no @var{timestamp} argument +is supplied, @code{gawk} will use the current time of day as the +time stamp.@refill + +@code{strftime} is guaranteed by the @sc{ansi} C standard to support +the following date format specifications: + +@table @code +@item %a +The locale's abbreviated weekday name. + +@item %A +The locale's full weekday name. + +@item %b +The locale's abbreviated month name. + +@item %B +The locale's full month name. + +@item %c +The locale's ``appropriate'' date and time representation. + +@item %d +The day of the month as a decimal number (01--31). + +@item %H +The hour (24-hour clock) as a decimal number (00--23). + +@item %I +The hour (12-hour clock) as a decimal number (01--12). + +@item %j +The day of the year as a decimal number (001--366). + +@item %m +The month as a decimal number (01--12). + +@item %M +The minute as a decimal number (00--59). + +@item %p +The locale's equivalent of the AM/PM designations associated +with a 12-hour clock. + +@item %S +The second as a decimal number (00--61). (Occasionally there are +minutes in a year with one or two leap seconds, which is why the +seconds can go from 0 all the way to 61.) + +@item %U +The week number of the year (the first Sunday as the first day of week 1) +as a decimal number (00--53). + +@item %w +The weekday as a decimal number (0--6). Sunday is day 0. + +@item %W +The week number of the year (the first Monday as the first day of week 1) +as a decimal number (00--53). + +@item %x +The locale's ``appropriate'' date representation. + +@item %X +The locale's ``appropriate'' time representation. + +@item %y +The year without century as a decimal number (00--99). + +@item %Y +The year with century as a decimal number. + +@item %Z +The time zone name or abbreviation, or no characters if +no time zone is determinable. + +@item %% +A literal @samp{%}. +@end table + +@c The parenthetical remark here should really be a footnote, but +@c it gave formatting problems at the FSF. So for now put it in +@c parentheses. +If a conversion specifier is not one of the above, the behavior is +undefined. (This is because the @sc{ansi} standard for C leaves the +behavior of the C version of @code{strftime} undefined, and @code{gawk} +will use the system's version of @code{strftime} if it's there. +Typically, the conversion specifier will either not appear in the +returned string, or it will appear literally.) + +Informally, a @dfn{locale} is the geographic place in which a program +is meant to run. For example, a common way to abbreviate the date +September 4, 1991 in the United States would be ``9/4/91''. +In many countries in Europe, however, it would be abbreviated ``4.9.91''. +Thus, the @samp{%x} specification in a @code{"US"} locale might produce +@samp{9/4/91}, while in a @code{"EUROPE"} locale, it might produce +@samp{4.9.91}. The @sc{ansi} C standard defines a default @code{"C"} +locale, which is an environment that is typical of what most C programmers +are used to. + +A public-domain C version of @code{strftime} is shipped with @code{gawk} +for systems that are not yet fully @sc{ansi}-compliant. If that version is +used to compile @code{gawk} (@pxref{Installation, ,Installing @code{gawk}}), +then the following additional format specifications are available:@refill + +@table @code +@item %D +Equivalent to specifying @samp{%m/%d/%y}. + +@item %e +The day of the month, padded with a blank if it is only one digit. + +@item %h +Equivalent to @samp{%b}, above. + +@item %n +A newline character (ASCII LF). + +@item %r +Equivalent to specifying @samp{%I:%M:%S %p}. + +@item %R +Equivalent to specifying @samp{%H:%M}. + +@item %T +Equivalent to specifying @samp{%H:%M:%S}. + +@item %t +A TAB character. + +@item %k +is replaced by the hour (24-hour clock) as a decimal number (0-23). +Single digit numbers are padded with a blank. + +@item %l +is replaced by the hour (12-hour clock) as a decimal number (1-12). +Single digit numbers are padded with a blank. + +@item %C +The century, as a number between 00 and 99. + +@item %u +is replaced by the weekday as a decimal number +[1 (Monday)--7]. + +@item %V +is replaced by the week number of the year (the first Monday as the first +day of week 1) as a decimal number (01--53). +The method for determining the week number is as specified by ISO 8601 +(to wit: if the week containing January 1 has four or more days in the +new year, then it is week 1, otherwise it is week 53 of the previous year +and the next week is week 1).@refill + +@item %Ec %EC %Ex %Ey %EY %Od %Oe %OH %OI +@itemx %Om %OM %OS %Ou %OU %OV %Ow %OW %Oy +These are ``alternate representations'' for the specifications +that use only the second letter (@samp{%c}, @samp{%C}, and so on). +They are recognized, but their normal representations are used. +(These facilitate compliance with the @sc{posix} @code{date} +utility.)@refill + +@item %v +The date in VMS format (e.g. 20-JUN-1991). +@end table + +Here are two examples that use @code{strftime}. The first is an +@code{awk} version of the C @code{ctime} function. (This is a +user defined function, which we have not discussed yet. +@xref{User-defined, ,User-defined Functions}, for more information.) + +@smallexample +# ctime.awk +# +# awk version of C ctime(3) function + +function ctime(ts, format) +@{ + format = "%a %b %e %H:%M:%S %Z %Y" + if (ts == 0) + ts = systime() # use current time as default + return strftime(format, ts) +@} +@end smallexample + +This next example is an @code{awk} implementation of the @sc{posix} +@code{date} utility. Normally, the @code{date} utility prints the +current date and time of day in a well known format. However, if you +provide an argument to it that begins with a @samp{+}, @code{date} +will copy non-format specifier characters to the standard output, and +will interpret the current time according to the format specifiers in +the string. For example: + +@smallexample +date '+Today is %A, %B %d, %Y.' +@end smallexample + +@noindent +might print + +@smallexample +Today is Thursday, July 11, 1991. +@end smallexample + +Here is the @code{awk} version of the @code{date} utility. + +@smallexample +#! /usr/bin/gawk -f +# +# date --- implement the P1003.2 Draft 11 'date' command +# +# Bug: does not recognize the -u argument. + +BEGIN \ +@{ + format = "%a %b %e %H:%M:%S %Z %Y" + exitval = 0 + + if (ARGC > 2) + exitval = 1 + else if (ARGC == 2) @{ + format = ARGV[1] + if (format ~ /^\+/) + format = substr(format, 2) # remove leading + + @} + print strftime(format) + exit exitval +@} +@end smallexample + +@node User-defined, Built-in Variables, Built-in, Top +@chapter User-defined Functions + +@cindex user-defined functions +@cindex functions, user-defined +Complicated @code{awk} programs can often be simplified by defining +your own functions. User-defined functions can be called just like +built-in ones (@pxref{Function Calls}), but it is up to you to define +them---to tell @code{awk} what they should do. + +@menu +* Definition Syntax:: How to write definitions and what they mean. +* Function Example:: An example function definition and + what it does. +* Function Caveats:: Things to watch out for. +* Return Statement:: Specifying the value a function returns. +@end menu + +@node Definition Syntax, Function Example, User-defined, User-defined +@section Syntax of Function Definitions +@cindex defining functions +@cindex function definition + +Definitions of functions can appear anywhere between the rules of the +@code{awk} program. Thus, the general form of an @code{awk} program is +extended to include sequences of rules @emph{and} user-defined function +definitions. + +The definition of a function named @var{name} looks like this: + +@example +function @var{name} (@var{parameter-list}) @{ + @var{body-of-function} +@} +@end example + +@noindent +@var{name} is the name of the function to be defined. A valid function +name is like a valid variable name: a sequence of letters, digits and +underscores, not starting with a digit. Functions share the same pool +of names as variables and arrays. + +@var{parameter-list} is a list of the function's arguments and local +variable names, separated by commas. When the function is called, +the argument names are used to hold the argument values given in +the call. The local variables are initialized to the null string. + +The @var{body-of-function} consists of @code{awk} statements. It is the +most important part of the definition, because it says what the function +should actually @emph{do}. The argument names exist to give the body a +way to talk about the arguments; local variables, to give the body +places to keep temporary values. + +Argument names are not distinguished syntactically from local variable +names; instead, the number of arguments supplied when the function is +called determines how many argument variables there are. Thus, if three +argument values are given, the first three names in @var{parameter-list} +are arguments, and the rest are local variables. + +It follows that if the number of arguments is not the same in all calls +to the function, some of the names in @var{parameter-list} may be +arguments on some occasions and local variables on others. Another +way to think of this is that omitted arguments default to the +null string. + +Usually when you write a function you know how many names you intend to +use for arguments and how many you intend to use as locals. By +convention, you should write an extra space between the arguments and +the locals, so other people can follow how your function is +supposed to be used. + +During execution of the function body, the arguments and local variable +values hide or @dfn{shadow} any variables of the same names used in the +rest of the program. The shadowed variables are not accessible in the +function definition, because there is no way to name them while their +names have been taken away for the local variables. All other variables +used in the @code{awk} program can be referenced or set normally in the +function definition. + +The arguments and local variables last only as long as the function body +is executing. Once the body finishes, the shadowed variables come back. + +The function body can contain expressions which call functions. They +can even call this function, either directly or by way of another +function. When this happens, we say the function is @dfn{recursive}. + +There is no need in @code{awk} to put the definition of a function +before all uses of the function. This is because @code{awk} reads the +entire program before starting to execute any of it. + +In many @code{awk} implementations, the keyword @code{function} may be +abbreviated @code{func}. However, @sc{posix} only specifies the use of +the keyword @code{function}. This actually has some practical implications. +If @code{gawk} is in @sc{posix}-compatibility mode +(@pxref{Command Line, ,Invoking @code{awk}}), then the following +statement will @emph{not} define a function:@refill + +@example +func foo() @{ a = sqrt($1) ; print a @} +@end example + +@noindent +Instead it defines a rule that, for each record, concatenates the value +of the variable @samp{func} with the return value of the function @samp{foo}, +and based on the truth value of the result, executes the corresponding action. +This is probably not what was desired. (@code{awk} accepts this input as +syntactically valid, since functions may be used before they are defined +in @code{awk} programs.) + +@node Function Example, Function Caveats, Definition Syntax, User-defined +@section Function Definition Example + +Here is an example of a user-defined function, called @code{myprint}, that +takes a number and prints it in a specific format. + +@example +function myprint(num) +@{ + printf "%6.3g\n", num +@} +@end example + +@noindent +To illustrate, here is an @code{awk} rule which uses our @code{myprint} +function: + +@example +$3 > 0 @{ myprint($3) @} +@end example + +@noindent +This program prints, in our special format, all the third fields that +contain a positive number in our input. Therefore, when given: + +@example + 1.2 3.4 5.6 7.8 + 9.10 11.12 -13.14 15.16 +17.18 19.20 21.22 23.24 +@end example + +@noindent +this program, using our function to format the results, prints: + +@example + 5.6 + 21.2 +@end example + +Here is a rather contrived example of a recursive function. It prints a +string backwards: + +@example +function rev (str, len) @{ + if (len == 0) @{ + printf "\n" + return + @} + printf "%c", substr(str, len, 1) + rev(str, len - 1) +@} +@end example + +@node Function Caveats, Return Statement, Function Example, User-defined +@section Calling User-defined Functions + +@dfn{Calling a function} means causing the function to run and do its job. +A function call is an expression, and its value is the value returned by +the function. + +A function call consists of the function name followed by the arguments +in parentheses. What you write in the call for the arguments are +@code{awk} expressions; each time the call is executed, these +expressions are evaluated, and the values are the actual arguments. For +example, here is a call to @code{foo} with three arguments (the first +being a string concatenation): + +@example +foo(x y, "lose", 4 * z) +@end example + +@quotation +@strong{Caution:} whitespace characters (spaces and tabs) are not allowed +between the function name and the open-parenthesis of the argument list. +If you write whitespace by mistake, @code{awk} might think that you mean +to concatenate a variable with an expression in parentheses. However, it +notices that you used a function name and not a variable name, and reports +an error. +@end quotation + +@cindex call by value +When a function is called, it is given a @emph{copy} of the values of +its arguments. This is called @dfn{call by value}. The caller may use +a variable as the expression for the argument, but the called function +does not know this: it only knows what value the argument had. For +example, if you write this code: + +@example +foo = "bar" +z = myfunc(foo) +@end example + +@noindent +then you should not think of the argument to @code{myfunc} as being +``the variable @code{foo}.'' Instead, think of the argument as the +string value, @code{"bar"}. + +If the function @code{myfunc} alters the values of its local variables, +this has no effect on any other variables. In particular, if @code{myfunc} +does this: + +@example +function myfunc (win) @{ + print win + win = "zzz" + print win +@} +@end example + +@noindent +to change its first argument variable @code{win}, this @emph{does not} +change the value of @code{foo} in the caller. The role of @code{foo} in +calling @code{myfunc} ended when its value, @code{"bar"}, was computed. +If @code{win} also exists outside of @code{myfunc}, the function body +cannot alter this outer value, because it is shadowed during the +execution of @code{myfunc} and cannot be seen or changed from there. + +@cindex call by reference +However, when arrays are the parameters to functions, they are @emph{not} +copied. Instead, the array itself is made available for direct manipulation +by the function. This is usually called @dfn{call by reference}. +Changes made to an array parameter inside the body of a function @emph{are} +visible outside that function. +@ifinfo +This can be @strong{very} dangerous if you do not watch what you are +doing. For example:@refill +@end ifinfo +@iftex +@emph{This can be very dangerous if you do not watch what you are +doing.} For example:@refill +@end iftex + +@example +function changeit (array, ind, nvalue) @{ + array[ind] = nvalue +@} + +BEGIN @{ + a[1] = 1 ; a[2] = 2 ; a[3] = 3 + changeit(a, 2, "two") + printf "a[1] = %s, a[2] = %s, a[3] = %s\n", a[1], a[2], a[3] + @} +@end example + +@noindent +prints @samp{a[1] = 1, a[2] = two, a[3] = 3}, because calling +@code{changeit} stores @code{"two"} in the second element of @code{a}. + +@node Return Statement, , Function Caveats, User-defined +@section The @code{return} Statement +@cindex @code{return} statement + +The body of a user-defined function can contain a @code{return} statement. +This statement returns control to the rest of the @code{awk} program. It +can also be used to return a value for use in the rest of the @code{awk} +program. It looks like this:@refill + +@example +return @var{expression} +@end example + +The @var{expression} part is optional. If it is omitted, then the returned +value is undefined and, therefore, unpredictable. + +A @code{return} statement with no value expression is assumed at the end of +every function definition. So if control reaches the end of the function +body, then the function returns an unpredictable value. @code{awk} +will not warn you if you use the return value of such a function; you will +simply get unpredictable or unexpected results. + +Here is an example of a user-defined function that returns a value +for the largest number among the elements of an array:@refill + +@example +@group +function maxelt (vec, i, ret) @{ + for (i in vec) @{ + if (ret == "" || vec[i] > ret) + ret = vec[i] + @} + return ret +@} +@end group +@end example + +@noindent +You call @code{maxelt} with one argument, which is an array name. The local +variables @code{i} and @code{ret} are not intended to be arguments; +while there is nothing to stop you from passing two or three arguments +to @code{maxelt}, the results would be strange. The extra space before +@code{i} in the function parameter list is to indicate that @code{i} and +@code{ret} are not supposed to be arguments. This is a convention which +you should follow when you define functions. + +Here is a program that uses our @code{maxelt} function. It loads an +array, calls @code{maxelt}, and then reports the maximum number in that +array:@refill + +@example +@group +awk ' +function maxelt (vec, i, ret) @{ + for (i in vec) @{ + if (ret == "" || vec[i] > ret) + ret = vec[i] + @} + return ret +@} +@end group + +@group +# Load all fields of each record into nums. +@{ + for(i = 1; i <= NF; i++) + nums[NR, i] = $i +@} + +END @{ + print maxelt(nums) +@}' +@end group +@end example + +Given the following input: + +@example +@group + 1 5 23 8 16 +44 3 5 2 8 26 +256 291 1396 2962 100 +-6 467 998 1101 +99385 11 0 225 +@end group +@end example + +@noindent +our program tells us (predictably) that: + +@example +99385 +@end example + +@noindent +is the largest number in our array. + +@node Built-in Variables, Command Line, User-defined, Top +@chapter Built-in Variables +@cindex built-in variables + +Most @code{awk} variables are available for you to use for your own +purposes; they never change except when your program assigns values to +them, and never affect anything except when your program examines them. + +A few variables have special built-in meanings. Some of them @code{awk} +examines automatically, so that they enable you to tell @code{awk} how +to do certain things. Others are set automatically by @code{awk}, so +that they carry information from the internal workings of @code{awk} to +your program. + +This chapter documents all the built-in variables of @code{gawk}. Most +of them are also documented in the chapters where their areas of +activity are described. + +@menu +* User-modified:: Built-in variables that you change + to control @code{awk}. +* Auto-set:: Built-in variables where @code{awk} + gives you information. +@end menu + +@node User-modified, Auto-set, Built-in Variables, Built-in Variables +@section Built-in Variables that Control @code{awk} +@cindex built-in variables, user modifiable + +This is a list of the variables which you can change to control how +@code{awk} does certain things. + +@table @code +@iftex +@vindex CONVFMT +@end iftex +@item CONVFMT +This string is used by @code{awk} to control conversion of numbers to +strings (@pxref{Conversion, ,Conversion of Strings and Numbers}). +It works by being passed, in effect, as the first argument to the +@code{sprintf} function. Its default value is @code{"%.6g"}. +@code{CONVFMT} was introduced by the @sc{posix} standard.@refill + +@iftex +@vindex FIELDWIDTHS +@end iftex +@item FIELDWIDTHS +This is a space separated list of columns that tells @code{gawk} +how to manage input with fixed, columnar boundaries. It is an +experimental feature that is still evolving. Assigning to @code{FIELDWIDTHS} +overrides the use of @code{FS} for field splitting. +@xref{Constant Size, ,Reading Fixed-width Data}, for more information.@refill + +If @code{gawk} is in compatibility mode +(@pxref{Command Line, ,Invoking @code{awk}}), then @code{FIELDWIDTHS} +has no special meaning, and field splitting operations are done based +exclusively on the value of @code{FS}.@refill + +@iftex +@vindex FS +@end iftex +@item FS +@code{FS} is the input field separator +(@pxref{Field Separators, ,Specifying how Fields are Separated}). +The value is a single-character string or a multi-character regular +expression that matches the separations between fields in an input +record.@refill + +The default value is @w{@code{" "}}, a string consisting of a single +space. As a special exception, this value actually means that any +sequence of spaces and tabs is a single separator. It also causes +spaces and tabs at the beginning or end of a line to be ignored. + +You can set the value of @code{FS} on the command line using the +@samp{-F} option: + +@example +awk -F, '@var{program}' @var{input-files} +@end example + +If @code{gawk} is using @code{FIELDWIDTHS} for field-splitting, +assigning a value to @code{FS} will cause @code{gawk} to return to +the normal, regexp-based, field splitting. + +@item IGNORECASE +@iftex +@vindex IGNORECASE +@end iftex +If @code{IGNORECASE} is nonzero, then @emph{all} regular expression +matching is done in a case-independent fashion. In particular, regexp +matching with @samp{~} and @samp{!~}, and the @code{gsub} @code{index}, +@code{match}, @code{split} and @code{sub} functions all ignore case when +doing their particular regexp operations. @strong{Note:} since field +splitting with the value of the @code{FS} variable is also a regular +expression operation, that too is done with case ignored. +@xref{Case-sensitivity, ,Case-sensitivity in Matching}. + +If @code{gawk} is in compatibility mode +(@pxref{Command Line, ,Invoking @code{awk}}), then @code{IGNORECASE} has +no special meaning, and regexp operations are always case-sensitive.@refill + +@item OFMT +@iftex +@vindex OFMT +@end iftex +This string is used by @code{awk} to control conversion of numbers to +strings (@pxref{Conversion, ,Conversion of Strings and Numbers}) for +printing with the @code{print} statement. +It works by being passed, in effect, as the first argument to the +@code{sprintf} function. Its default value is @code{"%.6g"}. +Earlier versions of @code{awk} also used @code{OFMT} to specify the +format for converting numbers to strings in general expressions; this +has been taken over by @code{CONVFMT}.@refill + +@item OFS +@iftex +@vindex OFS +@end iftex +This is the output field separator (@pxref{Output Separators}). It is +output between the fields output by a @code{print} statement. Its +default value is @w{@code{" "}}, a string consisting of a single space. + +@item ORS +@iftex +@vindex ORS +@end iftex +This is the output record separator. It is output at the end of every +@code{print} statement. Its default value is a string containing a +single newline character, which could be written as @code{"\n"}. +(@xref{Output Separators}.)@refill + +@item RS +@iftex +@vindex RS +@end iftex +This is @code{awk}'s input record separator. Its default value is a string +containing a single newline character, which means that an input record +consists of a single line of text. +(@xref{Records, ,How Input is Split into Records}.)@refill + +@item SUBSEP +@iftex +@vindex SUBSEP +@end iftex +@code{SUBSEP} is the subscript separator. It has the default value of +@code{"\034"}, and is used to separate the parts of the name of a +multi-dimensional array. Thus, if you access @code{foo[12,3]}, it +really accesses @code{foo["12\0343"]} +(@pxref{Multi-dimensional, ,Multi-dimensional Arrays}).@refill +@end table + +@node Auto-set, , User-modified, Built-in Variables +@section Built-in Variables that Convey Information + +This is a list of the variables that are set automatically by @code{awk} +on certain occasions so as to provide information to your program. + +@table @code +@item ARGC +@itemx ARGV +@iftex +@vindex ARGC +@vindex ARGV +@end iftex +The command-line arguments available to @code{awk} programs are stored in +an array called @code{ARGV}. @code{ARGC} is the number of command-line +arguments present. @xref{Command Line, ,Invoking @code{awk}}. +@code{ARGV} is indexed from zero to @w{@code{ARGC - 1}}. For example:@refill + +@example +awk 'BEGIN @{ + for (i = 0; i < ARGC; i++) + print ARGV[i] + @}' inventory-shipped BBS-list +@end example + +@noindent +In this example, @code{ARGV[0]} contains @code{"awk"}, @code{ARGV[1]} +contains @code{"inventory-shipped"}, and @code{ARGV[2]} contains +@code{"BBS-list"}. The value of @code{ARGC} is 3, one more than the +index of the last element in @code{ARGV} since the elements are numbered +from zero.@refill + +The names @code{ARGC} and @code{ARGV}, as well the convention of indexing +the array from 0 to @w{@code{ARGC - 1}}, are derived from the C language's +method of accessing command line arguments.@refill + +Notice that the @code{awk} program is not entered in @code{ARGV}. The +other special command line options, with their arguments, are also not +entered. But variable assignments on the command line @emph{are} +treated as arguments, and do show up in the @code{ARGV} array. + +Your program can alter @code{ARGC} and the elements of @code{ARGV}. +Each time @code{awk} reaches the end of an input file, it uses the next +element of @code{ARGV} as the name of the next input file. By storing a +different string there, your program can change which files are read. +You can use @code{"-"} to represent the standard input. By storing +additional elements and incrementing @code{ARGC} you can cause +additional files to be read. + +If you decrease the value of @code{ARGC}, that eliminates input files +from the end of the list. By recording the old value of @code{ARGC} +elsewhere, your program can treat the eliminated arguments as +something other than file names. + +To eliminate a file from the middle of the list, store the null string +(@code{""}) into @code{ARGV} in place of the file's name. As a +special feature, @code{awk} ignores file names that have been +replaced with the null string. + +@ignore +see getopt.awk in the examples... +@end ignore + +@item ARGIND +@vindex ARGIND +The index in @code{ARGV} of the current file being processed. +Every time @code{gawk} opens a new data file for processing, it sets +@code{ARGIND} to the index in @code{ARGV} of the file name. Thus, the +condition @samp{FILENAME == ARGV[ARGIND]} is always true. + +This variable is useful in file processing; it allows you to tell how far +along you are in the list of data files, and to distinguish between +multiple successive instances of the same filename on the command line. + +While you can change the value of @code{ARGIND} within your @code{awk} +program, @code{gawk} will automatically set it to a new value when the +next file is opened. + +This variable is a @code{gawk} extension; in other @code{awk} implementations +it is not special. + +@item ENVIRON +@vindex ENVIRON +This is an array that contains the values of the environment. The array +indices are the environment variable names; the values are the values of +the particular environment variables. For example, +@code{ENVIRON["HOME"]} might be @file{/u/close}. Changing this array +does not affect the environment passed on to any programs that +@code{awk} may spawn via redirection or the @code{system} function. +(In a future version of @code{gawk}, it may do so.) + +Some operating systems may not have environment variables. +On such systems, the array @code{ENVIRON} is empty. + +@item ERRNO +@iftex +@vindex ERRNO +@end iftex +If a system error occurs either doing a redirection for @code{getline}, +during a read for @code{getline}, or during a @code{close} operation, +then @code{ERRNO} will contain a string describing the error. + +This variable is a @code{gawk} extension; in other @code{awk} implementations +it is not special. + +@item FILENAME +@iftex +@vindex FILENAME +@end iftex +This is the name of the file that @code{awk} is currently reading. +If @code{awk} is reading from the standard input (in other words, +there are no files listed on the command line), +@code{FILENAME} is set to @code{"-"}. +@code{FILENAME} is changed each time a new file is read +(@pxref{Reading Files, ,Reading Input Files}).@refill + +@item FNR +@iftex +@vindex FNR +@end iftex +@code{FNR} is the current record number in the current file. @code{FNR} is +incremented each time a new record is read +(@pxref{Getline, ,Explicit Input with @code{getline}}). It is reinitialized +to 0 each time a new input file is started.@refill + +@item NF +@iftex +@vindex NF +@end iftex +@code{NF} is the number of fields in the current input record. +@code{NF} is set each time a new record is read, when a new field is +created, or when @code{$0} changes (@pxref{Fields, ,Examining Fields}).@refill + +@item NR +@iftex +@vindex NR +@end iftex +This is the number of input records @code{awk} has processed since +the beginning of the program's execution. +(@pxref{Records, ,How Input is Split into Records}). +@code{NR} is set each time a new record is read.@refill + +@item RLENGTH +@iftex +@vindex RLENGTH +@end iftex +@code{RLENGTH} is the length of the substring matched by the +@code{match} function +(@pxref{String Functions, ,Built-in Functions for String Manipulation}). +@code{RLENGTH} is set by invoking the @code{match} function. Its value +is the length of the matched string, or @minus{}1 if no match was found.@refill + +@item RSTART +@iftex +@vindex RSTART +@end iftex +@code{RSTART} is the start-index in characters of the substring matched by the +@code{match} function +(@pxref{String Functions, ,Built-in Functions for String Manipulation}). +@code{RSTART} is set by invoking the @code{match} function. Its value +is the position of the string where the matched substring starts, or 0 +if no match was found.@refill +@end table + +@node Command Line, Language History, Built-in Variables, Top +@c node-name, next, previous, up +@chapter Invoking @code{awk} +@cindex command line +@cindex invocation of @code{gawk} +@cindex arguments, command line +@cindex options, command line +@cindex long options +@cindex options, long + +There are two ways to run @code{awk}: with an explicit program, or with +one or more program files. Here are templates for both of them; items +enclosed in @samp{@r{[}@dots{}@r{]}} in these templates are optional. + +Besides traditional one-letter @sc{posix}-style options, @code{gawk} also +supports GNU long named options. + +@example +awk @r{[@var{POSIX or GNU style options}]} -f progfile @r{[@code{--}]} @var{file} @dots{} +awk @r{[@var{POSIX or GNU style options}]} @r{[@code{--}]} '@var{program}' @var{file} @dots{} +@end example + +@menu +* Options:: Command line options and their meanings. +* Other Arguments:: Input file names and variable assignments. +* AWKPATH Variable:: Searching directories for @code{awk} programs. +* Obsolete:: Obsolete Options and/or features. +* Undocumented:: Undocumented Options and Features. +@end menu + +@node Options, Other Arguments, Command Line, Command Line +@section Command Line Options + +Options begin with a minus sign, and consist of a single character. +GNU style long named options consist of two minus signs and +a keyword that can be abbreviated if the abbreviation allows the option +to be uniquely identified. If the option takes an argument, then the +keyword is immediately followed by an equals sign (@samp{=}) and the +argument's value. For brevity, the discussion below only refers to the +traditional short options; however the long and short options are +interchangeable in all contexts. + +Each long named option for @code{gawk} has a corresponding +@sc{posix}-style option. The options and their meanings are as follows: + +@table @code +@item -F @var{fs} +@itemx --field-separator=@var{fs} +@iftex +@cindex @code{-F} option +@end iftex +@cindex @code{--field-separator} option +Sets the @code{FS} variable to @var{fs} +(@pxref{Field Separators, ,Specifying how Fields are Separated}).@refill + +@item -f @var{source-file} +@itemx --file=@var{source-file} +@iftex +@cindex @code{-f} option +@end iftex +@cindex @code{--file} option +Indicates that the @code{awk} program is to be found in @var{source-file} +instead of in the first non-option argument. + +@item -v @var{var}=@var{val} +@itemx --assign=@var{var}=@var{val} +@cindex @samp{-v} option +@cindex @code{--assign} option +Sets the variable @var{var} to the value @var{val} @emph{before} +execution of the program begins. Such variable values are available +inside the @code{BEGIN} rule (see below for a fuller explanation). + +The @samp{-v} option can only set one variable, but you can use +it more than once, setting another variable each time, like this: +@samp{@w{-v foo=1} @w{-v bar=2}}. + +@item -W @var{gawk-opt} +@cindex @samp{-W} option +Following the @sc{posix} standard, options that are implementation +specific are supplied as arguments to the @samp{-W} option. With @code{gawk}, +these arguments may be separated by commas, or quoted and separated by +whitespace. Case is ignored when processing these options. These options +also have corresponding GNU style long named options. The following +@code{gawk}-specific options are available: + +@table @code +@item -W compat +@itemx --compat +@cindex @code{--compat} option +Specifies @dfn{compatibility mode}, in which the GNU extensions in +@code{gawk} are disabled, so that @code{gawk} behaves just like Unix +@code{awk}. +@xref{POSIX/GNU, ,Extensions in @code{gawk} not in POSIX @code{awk}}, +which summarizes the extensions. Also see +@ref{Compatibility Mode, ,Downward Compatibility and Debugging}.@refill + +@item -W copyleft +@itemx -W copyright +@itemx --copyleft +@itemx --copyright +@cindex @code{--copyleft} option +@cindex @code{--copyright} option +Print the short version of the General Public License. +This option may disappear in a future version of @code{gawk}. + +@item -W help +@itemx -W usage +@itemx --help +@itemx --usage +@cindex @code{--help} option +@cindex @code{--usage} option +Print a ``usage'' message summarizing the short and long style options +that @code{gawk} accepts, and then exit. + +@item -W lint +@itemx --lint +@cindex @code{--lint} option +Provide warnings about constructs that are dubious or non-portable to +other @code{awk} implementations. +Some warnings are issued when @code{gawk} first reads your program. Others +are issued at run-time, as your program executes. + +@item -W posix +@itemx --posix +@cindex @code{--posix} option +Operate in strict @sc{posix} mode. This disables all @code{gawk} +extensions (just like @code{-W compat}), and adds the following additional +restrictions: + +@itemize @bullet{} +@item +@code{\x} escape sequences are not recognized +(@pxref{Constants, ,Constant Expressions}).@refill + +@item +The synonym @code{func} for the keyword @code{function} is not +recognized (@pxref{Definition Syntax, ,Syntax of Function Definitions}). + +@item +The operators @samp{**} and @samp{**=} cannot be used in +place of @samp{^} and @samp{^=} (@pxref{Arithmetic Ops, ,Arithmetic Operators}, +and also @pxref{Assignment Ops, ,Assignment Expressions}).@refill + +@item +Specifying @samp{-Ft} on the command line does not set the value +of @code{FS} to be a single tab character +(@pxref{Field Separators, ,Specifying how Fields are Separated}).@refill +@end itemize + +Although you can supply both @samp{-W compat} and @samp{-W posix} on the +command line, @samp{-W posix} will take precedence. + +@item -W source=@var{program-text} +@itemx --source=@var{program-text} +@cindex @code{--source} option +Program source code is taken from the @var{program-text}. This option +allows you to mix @code{awk} source code in files with program source +code that you would enter on the command line. This is particularly useful +when you have library functions that you wish to use from your command line +programs (@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}). + +@item -W version +@itemx --version +@cindex @code{--version} option +Prints version information for this particular copy of @code{gawk}. +This is so you can determine if your copy of @code{gawk} is up to date +with respect to whatever the Free Software Foundation is currently +distributing. This option may disappear in a future version of @code{gawk}. +@end table + +@item -- +Signals the end of the command line options. The following arguments +are not treated as options even if they begin with @samp{-}. This +interpretation of @samp{--} follows the @sc{posix} argument parsing +conventions. + +This is useful if you have file names that start with @samp{-}, +or in shell scripts, if you have file names that will be specified +by the user which could start with @samp{-}. +@end table + +Any other options are flagged as invalid with a warning message, but +are otherwise ignored. + +In compatibility mode, as a special case, if the value of @var{fs} supplied +to the @samp{-F} option is @samp{t}, then @code{FS} is set to the tab +character (@code{"\t"}). This is only true for @samp{-W compat}, and not +for @samp{-W posix} +(@pxref{Field Separators, ,Specifying how Fields are Separated}).@refill + +If the @samp{-f} option is @emph{not} used, then the first non-option +command line argument is expected to be the program text. + +The @samp{-f} option may be used more than once on the command line. +If it is, @code{awk} reads its program source from all of the named files, as +if they had been concatenated together into one big file. This is +useful for creating libraries of @code{awk} functions. Useful functions +can be written once, and then retrieved from a standard place, instead +of having to be included into each individual program. You can still +type in a program at the terminal and use library functions, by specifying +@samp{-f /dev/tty}. @code{awk} will read a file from the terminal +to use as part of the @code{awk} program. After typing your program, +type @kbd{Control-d} (the end-of-file character) to terminate it. +(You may also use @samp{-f -} to read program source from the standard +input, but then you will not be able to also use the standard input as a +source of data.) + +Because it is clumsy using the standard @code{awk} mechanisms to mix source +file and command line @code{awk} programs, @code{gawk} provides the +@samp{--source} option. This does not require you to pre-empt the standard +input for your source code, and allows you to easily mix command line +and library source code +(@pxref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}). + +If no @samp{-f} or @samp{--source} option is specified, then @code{gawk} +will use the first non-option command line argument as the text of the +program source code. + +@node Other Arguments, AWKPATH Variable, Options, Command Line +@section Other Command Line Arguments + +Any additional arguments on the command line are normally treated as +input files to be processed in the order specified. However, an +argument that has the form @code{@var{var}=@var{value}}, means to assign +the value @var{value} to the variable @var{var}---it does not specify a +file at all. + +@vindex ARGV +All these arguments are made available to your @code{awk} program in the +@code{ARGV} array (@pxref{Built-in Variables}). Command line options +and the program text (if present) are omitted from the @code{ARGV} +array. All other arguments, including variable assignments, are +included. + +The distinction between file name arguments and variable-assignment +arguments is made when @code{awk} is about to open the next input file. +At that point in execution, it checks the ``file name'' to see whether +it is really a variable assignment; if so, @code{awk} sets the variable +instead of reading a file. + +Therefore, the variables actually receive the specified values after all +previously specified files have been read. In particular, the values of +variables assigned in this fashion are @emph{not} available inside a +@code{BEGIN} rule +(@pxref{BEGIN/END, ,@code{BEGIN} and @code{END} Special Patterns}), +since such rules are run before @code{awk} begins scanning the argument list. +The values given on the command line are processed for escape sequences +(@pxref{Constants, ,Constant Expressions}).@refill + +In some earlier implementations of @code{awk}, when a variable assignment +occurred before any file names, the assignment would happen @emph{before} +the @code{BEGIN} rule was executed. Some applications came to depend +upon this ``feature.'' When @code{awk} was changed to be more consistent, +the @samp{-v} option was added to accommodate applications that depended +upon this old behavior. + +The variable assignment feature is most useful for assigning to variables +such as @code{RS}, @code{OFS}, and @code{ORS}, which control input and +output formats, before scanning the data files. It is also useful for +controlling state if multiple passes are needed over a data file. For +example:@refill + +@cindex multiple passes over data +@cindex passes, multiple +@smallexample +awk 'pass == 1 @{ @var{pass 1 stuff} @} + pass == 2 @{ @var{pass 2 stuff} @}' pass=1 datafile pass=2 datafile +@end smallexample + +Given the variable assignment feature, the @samp{-F} option is not +strictly necessary. It remains for historical compatibility. + +@node AWKPATH Variable, Obsolete, Other Arguments, Command Line +@section The @code{AWKPATH} Environment Variable +@cindex @code{AWKPATH} environment variable +@cindex search path +@cindex directory search +@cindex path, search +@iftex +@cindex differences between @code{gawk} and @code{awk} +@end iftex + +The previous section described how @code{awk} program files can be named +on the command line with the @samp{-f} option. In some @code{awk} +implementations, you must supply a precise path name for each program +file, unless the file is in the current directory. + +But in @code{gawk}, if the file name supplied in the @samp{-f} option +does not contain a @samp{/}, then @code{gawk} searches a list of +directories (called the @dfn{search path}), one by one, looking for a +file with the specified name. + +The search path is actually a string consisting of directory names +separated by colons. @code{gawk} gets its search path from the +@code{AWKPATH} environment variable. If that variable does not exist, +@code{gawk} uses the default path, which is +@samp{.:/usr/lib/awk:/usr/local/lib/awk}. (Programs written by +system administrators should use an @code{AWKPATH} variable that +does not include the current directory, @samp{.}.)@refill + +The search path feature is particularly useful for building up libraries +of useful @code{awk} functions. The library files can be placed in a +standard directory that is in the default path, and then specified on +the command line with a short file name. Otherwise, the full file name +would have to be typed for each file. + +By combining the @samp{--source} and @samp{-f} options, your command line +@code{awk} programs can use facilities in @code{awk} library files. + +Path searching is not done if @code{gawk} is in compatibility mode. +This is true for both @samp{-W compat} and @samp{-W posix}. +@xref{Options, ,Command Line Options}. + +@strong{Note:} if you want files in the current directory to be found, +you must include the current directory in the path, either by writing +@file{.} as an entry in the path, or by writing a null entry in the +path. (A null entry is indicated by starting or ending the path with a +colon, or by placing two colons next to each other (@samp{::}).) If the +current directory is not included in the path, then files cannot be +found in the current directory. This path search mechanism is identical +to the shell's. +@c someday, @cite{The Bourne Again Shell}.... + +@node Obsolete, Undocumented, AWKPATH Variable, Command Line +@section Obsolete Options and/or Features + +@cindex deprecated options +@cindex obsolete options +@cindex deprecated features +@cindex obsolete features +This section describes features and/or command line options from the +previous release of @code{gawk} that are either not available in the +current version, or that are still supported but deprecated (meaning that +they will @emph{not} be in the next release). + +@c update this section for each release! + +For version 2.15 of @code{gawk}, the following command line options +from version 2.11.1 are no longer recognized. + +@table @samp +@ignore +@item -nostalgia +Use @samp{-W nostalgia} instead. +@end ignore + +@item -c +Use @samp{-W compat} instead. + +@item -V +Use @samp{-W version} instead. + +@item -C +Use @samp{-W copyright} instead. + +@item -a +@itemx -e +These options produce an ``unrecognized option'' error message but have +no effect on the execution of @code{gawk}. The @sc{posix} standard now +specifies traditional @code{awk} regular expressions for the @code{awk} utility. +@end table + +The public-domain version of @code{strftime} that is distributed with +@code{gawk} changed for the 2.14 release. The @samp{%V} conversion specifier +that used to generate the date in VMS format was changed to @samp{%v}. +This is because the @sc{posix} standard for the @code{date} utility now +specifies a @samp{%V} conversion specifier. +@xref{Time Functions, ,Functions for Dealing with Time Stamps}, for details. + +@node Undocumented, , Obsolete, Command Line +@section Undocumented Options and Features + +This section intentionally left blank. + +@c Read The Source, Luke! + +@ignore +@c If these came out in the Info file or TeX manual, then they wouldn't +@c be undocumented, would they? + +@code{gawk} has one undocumented option: + +@table @samp +@item -W nostalgia +Print the message @code{"awk: bailing out near line 1"} and dump core. +This option was inspired by the common behavior of very early versions of +Unix @code{awk}, and by a t--shirt. +@end table + +Early versions of @code{awk} used to not require any separator (either +a newline or @samp{;}) between the rules in @code{awk} programs. Thus, +it was common to see one-line programs like: + +@example +awk '@{ sum += $1 @} END @{ print sum @}' +@end example + +@code{gawk} actually supports this, but it is purposely undocumented +since it is considered bad style. The correct way to write such a program +is either + +@example +awk '@{ sum += $1 @} ; END @{ print sum @}' +@end example + +@noindent +or + +@example +awk '@{ sum += $1 @} + END @{ print sum @}' data +@end example + +@noindent +@xref{Statements/Lines, ,@code{awk} Statements versus Lines}, for a fuller +explanation.@refill + +As an accident of the implementation of the original Unix @code{awk}, if +a built-in function used @code{$0} as its default argument, it was possible +to call that function without the parentheses. In particular, it was +common practice to use the @code{length} function in this fashion. +For example, the pipeline: + +@example +echo abcdef | awk '@{ print length @}' +@end example + +@noindent +would print @samp{6}. + +For backwards compatibility with old programs, @code{gawk} supports +this usage, but only for the @code{length} function. New programs should +@emph{not} call the @code{length} function this way. In particular, +this usage will not be portable to other @sc{posix} compliant versions +of @code{awk}. It is also poor style. + +@end ignore + +@node Language History, Installation, Command Line, Top +@chapter The Evolution of the @code{awk} Language + +This manual describes the GNU implementation of @code{awk}, which is patterned +after the @sc{posix} specification. Many @code{awk} users are only familiar +with the original @code{awk} implementation in Version 7 Unix, which is also +the basis for the version in Berkeley Unix (through 4.3--Reno). This chapter +briefly describes the evolution of the @code{awk} language. + +@menu +* V7/S5R3.1:: The major changes between V7 and + System V Release 3.1. +* S5R4:: Minor changes between System V + Releases 3.1 and 4. +* POSIX:: New features from the @sc{posix} standard. +* POSIX/GNU:: The extensions in @code{gawk} + not in @sc{posix} @code{awk}. +@end menu + +@node V7/S5R3.1, S5R4, Language History, Language History +@section Major Changes between V7 and S5R3.1 + +The @code{awk} language evolved considerably between the release of +Version 7 Unix (1978) and the new version first made widely available in +System V Release 3.1 (1987). This section summarizes the changes, with +cross-references to further details. + +@itemize @bullet +@item +The requirement for @samp{;} to separate rules on a line +(@pxref{Statements/Lines, ,@code{awk} Statements versus Lines}). + +@item +User-defined functions, and the @code{return} statement +(@pxref{User-defined, ,User-defined Functions}). + +@item +The @code{delete} statement (@pxref{Delete, ,The @code{delete} Statement}). + +@item +The @code{do}-@code{while} statement +(@pxref{Do Statement, ,The @code{do}-@code{while} Statement}).@refill + +@item +The built-in functions @code{atan2}, @code{cos}, @code{sin}, @code{rand} and +@code{srand} (@pxref{Numeric Functions, ,Numeric Built-in Functions}). + +@item +The built-in functions @code{gsub}, @code{sub}, and @code{match} +(@pxref{String Functions, ,Built-in Functions for String Manipulation}). + +@item +The built-in functions @code{close}, which closes an open file, and +@code{system}, which allows the user to execute operating system +commands (@pxref{I/O Functions, ,Built-in Functions for Input/Output}).@refill +@c Does the above verbiage prevents an overfull hbox? --mew, rjc 24jan1992 + +@item +The @code{ARGC}, @code{ARGV}, @code{FNR}, @code{RLENGTH}, @code{RSTART}, +and @code{SUBSEP} built-in variables (@pxref{Built-in Variables}). + +@item +The conditional expression using the operators @samp{?} and @samp{:} +(@pxref{Conditional Exp, ,Conditional Expressions}).@refill + +@item +The exponentiation operator @samp{^} +(@pxref{Arithmetic Ops, ,Arithmetic Operators}) and its assignment operator +form @samp{^=} (@pxref{Assignment Ops, ,Assignment Expressions}).@refill + +@item +C-compatible operator precedence, which breaks some old @code{awk} +programs (@pxref{Precedence, ,Operator Precedence (How Operators Nest)}). + +@item +Regexps as the value of @code{FS} +(@pxref{Field Separators, ,Specifying how Fields are Separated}), and as the +third argument to the @code{split} function +(@pxref{String Functions, ,Built-in Functions for String Manipulation}).@refill + +@item +Dynamic regexps as operands of the @samp{~} and @samp{!~} operators +(@pxref{Regexp Usage, ,How to Use Regular Expressions}). + +@item +Escape sequences (@pxref{Constants, ,Constant Expressions}) in regexps.@refill + +@item +The escape sequences @samp{\b}, @samp{\f}, and @samp{\r} +(@pxref{Constants, ,Constant Expressions}). + +@item +Redirection of input for the @code{getline} function +(@pxref{Getline, ,Explicit Input with @code{getline}}).@refill + +@item +Multiple @code{BEGIN} and @code{END} rules +(@pxref{BEGIN/END, ,@code{BEGIN} and @code{END} Special Patterns}).@refill + +@item +Simulated multi-dimensional arrays +(@pxref{Multi-dimensional, ,Multi-dimensional Arrays}).@refill +@end itemize + +@node S5R4, POSIX, V7/S5R3.1, Language History +@section Changes between S5R3.1 and S5R4 + +The System V Release 4 version of Unix @code{awk} added these features +(some of which originated in @code{gawk}): + +@itemize @bullet +@item +The @code{ENVIRON} variable (@pxref{Built-in Variables}). + +@item +Multiple @samp{-f} options on the command line +(@pxref{Command Line, ,Invoking @code{awk}}).@refill + +@item +The @samp{-v} option for assigning variables before program execution begins +(@pxref{Command Line, ,Invoking @code{awk}}).@refill + +@item +The @samp{--} option for terminating command line options. + +@item +The @samp{\a}, @samp{\v}, and @samp{\x} escape sequences +(@pxref{Constants, ,Constant Expressions}).@refill + +@item +A defined return value for the @code{srand} built-in function +(@pxref{Numeric Functions, ,Numeric Built-in Functions}). + +@item +The @code{toupper} and @code{tolower} built-in string functions +for case translation +(@pxref{String Functions, ,Built-in Functions for String Manipulation}).@refill + +@item +A cleaner specification for the @samp{%c} format-control letter in the +@code{printf} function +(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}).@refill + +@item +The ability to dynamically pass the field width and precision (@code{"%*.*d"}) +in the argument list of the @code{printf} function +(@pxref{Printf, ,Using @code{printf} Statements for Fancier Printing}).@refill + +@item +The use of constant regexps such as @code{/foo/} as expressions, where +they are equivalent to use of the matching operator, as in @code{$0 ~ +/foo/} (@pxref{Constants, ,Constant Expressions}). +@end itemize + +@node POSIX, POSIX/GNU, S5R4, Language History +@section Changes between S5R4 and POSIX @code{awk} + +The @sc{posix} Command Language and Utilities standard for @code{awk} +introduced the following changes into the language: + +@itemize @bullet{} +@item +The use of @samp{-W} for implementation-specific options. + +@item +The use of @code{CONVFMT} for controlling the conversion of numbers +to strings (@pxref{Conversion, ,Conversion of Strings and Numbers}). + +@item +The concept of a numeric string, and tighter comparison rules to go +with it (@pxref{Comparison Ops, ,Comparison Expressions}). + +@item +More complete documentation of many of the previously undocumented +features of the language. +@end itemize + +@node POSIX/GNU, , POSIX, Language History +@section Extensions in @code{gawk} not in POSIX @code{awk} + +The GNU implementation, @code{gawk}, adds these features: + +@itemize @bullet +@item +The @code{AWKPATH} environment variable for specifying a path search for +the @samp{-f} command line option +(@pxref{Command Line, ,Invoking @code{awk}}).@refill + +@item +The various @code{gawk} specific features available via the @samp{-W} +command line option (@pxref{Command Line, ,Invoking @code{awk}}). + +@item +The @code{ARGIND} variable, that tracks the movement of @code{FILENAME} +through @code{ARGV}. (@pxref{Built-in Variables}). + +@item +The @code{ERRNO} variable, that contains the system error message when +@code{getline} returns @minus{}1, or when @code{close} fails. +(@pxref{Built-in Variables}). + +@item +The @code{IGNORECASE} variable and its effects +(@pxref{Case-sensitivity, ,Case-sensitivity in Matching}).@refill + +@item +The @code{FIELDWIDTHS} variable and its effects +(@pxref{Constant Size, ,Reading Fixed-width Data}).@refill + +@item +The @code{next file} statement for skipping to the next data file +(@pxref{Next File Statement, ,The @code{next file} Statement}).@refill + +@item +The @code{systime} and @code{strftime} built-in functions for obtaining +and printing time stamps +(@pxref{Time Functions, ,Functions for Dealing with Time Stamps}).@refill + +@item +The @file{/dev/stdin}, @file{/dev/stdout}, @file{/dev/stderr}, and +@file{/dev/fd/@var{n}} file name interpretation +(@pxref{Special Files, ,Standard I/O Streams}).@refill + +@item +The @samp{-W compat} option to turn off these extensions +(@pxref{Command Line, ,Invoking @code{awk}}).@refill + +@item +The @samp{-W posix} option for full @sc{posix} compliance +(@pxref{Command Line, ,Invoking @code{awk}}).@refill + +@end itemize + +@node Installation, Gawk Summary, Language History, Top +@chapter Installing @code{gawk} + +This chapter provides instructions for installing @code{gawk} on the +various platforms that are supported by the developers. The primary +developers support Unix (and one day, GNU), while the other ports were +contributed. The file @file{ACKNOWLEDGMENT} in the @code{gawk} +distribution lists the electronic mail addresses of the people who did +the respective ports.@refill + +@menu +* Gawk Distribution:: What is in the @code{gawk} distribution. +* Unix Installation:: Installing @code{gawk} under various versions + of Unix. +* VMS Installation:: Installing @code{gawk} on VMS. +* MS-DOS Installation:: Installing @code{gawk} on MS-DOS. +* Atari Installation:: Installing @code{gawk} on the Atari ST. +@end menu + +@node Gawk Distribution, Unix Installation, Installation, Installation +@section The @code{gawk} Distribution + +This section first describes how to get and extract the @code{gawk} +distribution, and then discusses what is in the various files and +subdirectories. + +@menu +* Extracting:: How to get and extract the distribution. +* Distribution contents:: What is in the distribution. +@end menu + +@node Extracting, Distribution contents, Gawk Distribution, Gawk Distribution +@subsection Getting the @code{gawk} Distribution + +@cindex getting gawk +@cindex anonymous ftp +@cindex anonymous uucp +@cindex ftp, anonymous +@cindex uucp, anonymous +@code{gawk} is distributed as a @code{tar} file compressed with the +GNU Zip program, @code{gzip}. You can +get it via anonymous @code{ftp} to the Internet host @code{prep.ai.mit.edu}. +Like all GNU software, it will be archived at other well known systems, +from which it will be possible to use some sort of anonymous @code{uucp} to +obtain the distribution as well. +You can also order @code{gawk} on tape or CD-ROM directly from the +Free Software Foundation. (The address is on the copyright page.) +Doing so directly contributes to the support of the foundation and to +the production of more free software. + +Once you have the distribution (for example, +@file{gawk-2.15.0.tar.z}), first use @code{gzip} to expand the +file, and then use @code{tar} to extract it. You can use the following +pipeline to produce the @code{gawk} distribution: + +@example +# Under System V, add 'o' to the tar flags +gzip -d -c gawk-2.15.0.tar.z | tar -xvpf - +@end example + +@noindent +This will create a directory named @file{gawk-2.15} in the current +directory. + +The distribution file name is of the form @file{gawk-2.15.@var{n}.tar.Z}. +The @var{n} represents a @dfn{patchlevel}, meaning that minor bugs have +been fixed in the major release. The current patchlevel is 0, but when +retrieving distributions, you should get the version with the highest +patchlevel.@refill + +If you are not on a Unix system, you will need to make other arrangements +for getting and extracting the @code{gawk} distribution. You should consult +a local expert. + +@node Distribution contents, , Extracting, Gawk Distribution +@subsection Contents of the @code{gawk} Distribution + +@code{gawk} has a number of C source files, documentation files, +subdirectories and files related to the configuration process +(@pxref{Unix Installation, ,Compiling and Installing @code{gawk} on Unix}), +and several subdirectories related to different, non-Unix, +operating systems.@refill + +@table @asis +@item various @samp{.c}, @samp{.y}, and @samp{.h} files + +The C and YACC source files are the actual @code{gawk} source code. +@end table + +@table @file +@item README +@itemx README.VMS +@itemx README.dos +@itemx README.rs6000 +@itemx README.ultrix +Descriptive files: @file{README} for @code{gawk} under Unix, and the +rest for the various hardware and software combinations. + +@item PORTS +A list of systems to which @code{gawk} has been ported, and which +have successfully run the test suite. + +@item ACKNOWLEDGMENT +A list of the people who contributed major parts of the code or documentation. + +@item NEWS +A list of changes to @code{gawk} since the last release or patch. + +@item COPYING +The GNU General Public License. + +@item FUTURES +A brief list of features and/or changes being contemplated for future +releases, with some indication of the time frame for the feature, based +on its difficulty. + +@item LIMITATIONS +A list of those factors that limit @code{gawk}'s performance. +Most of these depend on the hardware or operating system software, and +are not limits in @code{gawk} itself.@refill + +@item PROBLEMS +A file describing known problems with the current release. + +@item gawk.1 +The @code{troff} source for a manual page describing @code{gawk}. + +@item gawk.texinfo +@ifinfo +The @code{texinfo} source file for this Info file. +It should be processed with @TeX{} to produce a printed manual, and +with @code{makeinfo} to produce the Info file.@refill +@end ifinfo +@iftex +The @code{texinfo} source file for this manual. +It should be processed with @TeX{} to produce a printed manual, and +with @code{makeinfo} to produce the Info file.@refill +@end iftex + +@item Makefile.in +@itemx config +@itemx config.in +@itemx configure +@itemx missing +@itemx mungeconf +These files and subdirectories are used when configuring @code{gawk} +for various Unix systems. They are explained in detail in +@ref{Unix Installation, ,Compiling and Installing @code{gawk} on Unix}.@refill + +@item atari +Files needed for building @code{gawk} on an Atari ST. +@xref{Atari Installation, ,Installing @code{gawk} on the Atari ST}, for details. + +@item pc +Files needed for building @code{gawk} under MS-DOS. +@xref{MS-DOS Installation, ,Installing @code{gawk} on MS-DOS}, for details. + +@item vms +Files needed for building @code{gawk} under VMS. +@xref{VMS Installation, ,Compiling Installing and Running @code{gawk} on VMS}, for details. + +@item test +Many interesting @code{awk} programs, provided as a test suite for +@code{gawk}. You can use @samp{make test} from the top level @code{gawk} +directory to run your version of @code{gawk} against the test suite. +@c There are many programs here that are useful in their own right. +If @code{gawk} successfully passes @samp{make test} then you can +be confident of a successful port.@refill +@end table + +@node Unix Installation, VMS Installation, Gawk Distribution, Installation +@section Compiling and Installing @code{gawk} on Unix + +Often, you can compile and install @code{gawk} by typing only two +commands. However, if you do not use a supported system, you may need +to configure @code{gawk} for your system yourself. + +@menu +* Quick Installation:: Compiling @code{gawk} on a + supported Unix version. +* Configuration Philosophy:: How it's all supposed to work. +* New Configurations:: What to do if there is no supplied + configuration for your system. +@end menu + +@node Quick Installation, Configuration Philosophy, Unix Installation, Unix Installation +@subsection Compiling @code{gawk} for a Supported Unix Version + +@cindex installation, unix +After you have extracted the @code{gawk} distribution, @code{cd} +to @file{gawk-2.15}. Look in the @file{config} subdirectory for a +file that matches your hardware/software combination. In general, +only the software is relevant; for example @code{sunos41} is used +for SunOS 4.1, on both Sun 3 and Sun 4 hardware.@refill + +If you find such a file, run the command: + +@example +# assume you have SunOS 4.1 +./configure sunos41 +@end example + +This produces a @file{Makefile} and @file{config.h} tailored to your +system. You may wish to edit the @file{Makefile} to use a different +C compiler, such as @code{gcc}, the GNU C compiler, if you have it. +You may also wish to change the @code{CFLAGS} variable, which controls +the command line options that are passed to the C compiler (such as +optimization levels, or compiling for debugging).@refill + +After you have configured @file{Makefile} and @file{config.h}, type: + +@example +make +@end example + +@noindent +and shortly thereafter, you should have an executable version of @code{gawk}. +That's all there is to it! + +@node Configuration Philosophy, New Configurations, Quick Installation, Unix Installation +@subsection The Configuration Process + +(This section is of interest only if you know something about using the +C language and the Unix operating system.) + +The source code for @code{gawk} generally attempts to adhere to industry +standards wherever possible. This means that @code{gawk} uses library +routines that are specified by the @sc{ansi} C standard and by the @sc{posix} +operating system interface standard. When using an @sc{ansi} C compiler, +function prototypes are provided to help improve the compile-time checking. + +Many older Unix systems do not support all of either the @sc{ansi} or the +@sc{posix} standards. The @file{missing} subdirectory in the @code{gawk} +distribution contains replacement versions of those subroutines that are +most likely to be missing. + +The @file{config.h} file that is created by the @code{configure} program +contains definitions that describe features of the particular operating +system where you are attempting to compile @code{gawk}. For the most +part, it lists which standard subroutines are @emph{not} available. +For example, if your system lacks the @samp{getopt} routine, then +@samp{GETOPT_MISSING} would be defined. + +@file{config.h} also defines constants that describe facts about your +variant of Unix. For example, there may not be an @samp{st_blksize} +element in the @code{stat} structure. In this case @samp{BLKSIZE_MISSING} +would be defined. + +Based on the list in @file{config.h} of standard subroutines that are +missing, @file{missing.c} will do a @samp{#include} of the appropriate +file(s) from the @file{missing} subdirectory.@refill + +Conditionally compiled code in the other source files relies on the +other definitions in the @file{config.h} file. + +Besides creating @file{config.h}, @code{configure} produces a @file{Makefile} +from @file{Makefile.in}. There are a number of lines in @file{Makefile.in} +that are system or feature specific. For example, there is line that begins +with @samp{##MAKE_ALLOCA_C##}. This is normally a comment line, since +it starts with @samp{#}. If a configuration file has @samp{MAKE_ALLOCA_C} +in it, then @code{configure} will delete the @samp{##MAKE_ALLOCA_C##} +from the beginning of the line. This will enable the rules in the +@file{Makefile} that use a C version of @samp{alloca}. There are several +similar features that work in this fashion.@refill + +@node New Configurations, , Configuration Philosophy, Unix Installation +@subsection Configuring @code{gawk} for a New System + +(This section is of interest only if you know something about using the +C language and the Unix operating system, and if you have to install +@code{gawk} on a system that is not supported by the @code{gawk} distribution. +If you are a C or Unix novice, get help from a local expert.) + +If you need to configure @code{gawk} for a Unix system that is not +supported in the distribution, first see +@ref{Configuration Philosophy, ,The Configuration Process}. +Then, copy @file{config.in} to @file{config.h}, and copy +@file{Makefile.in} to @file{Makefile}.@refill + +Next, edit both files. Both files are liberally commented, and the +necessary changes should be straightforward. + +While editing @file{config.h}, you need to determine what library +routines you do or do not have by consulting your system documentation, or +by perusing your actual libraries using the @code{ar} or @code{nm} utilities. +In the worst case, simply do not define @emph{any} of the macros for missing +subroutines. When you compile @code{gawk}, the final link-editing step +will fail. The link editor will provide you with a list of unresolved external +references---these are the missing subroutines. Edit @file{config.h} again +and recompile, and you should be set.@refill + +Editing the @file{Makefile} should also be straightforward. Enable or +disable the lines that begin with @samp{##MAKE_@var{whatever}##}, as +appropriate. Select the correct C compiler and @code{CFLAGS} for it. +Then run @code{make}. + +Getting a correct configuration is likely to be an iterative process. +Do not be discouraged if it takes you several tries. If you have no +luck whatsoever, please report your system type, and the steps you took. +Once you do have a working configuration, please send it to the maintainers +so that support for your system can be added to the official release. + +@xref{Bugs, ,Reporting Problems and Bugs}, for information on how to report +problems in configuring @code{gawk}. You may also use the same mechanisms +for sending in new configurations.@refill + +@node VMS Installation, MS-DOS Installation, Unix Installation, Installation +@section Compiling, Installing, and Running @code{gawk} on VMS + +@c based on material from +@c Pat Rankin <rankin@eql.caltech.edu> + +@cindex installation, vms +This section describes how to compile and install @code{gawk} under VMS. + +@menu +* VMS Compilation:: How to compile @code{gawk} under VMS. +* VMS Installation Details:: How to install @code{gawk} under VMS. +* VMS Running:: How to run @code{gawk} under VMS. +* VMS POSIX:: Alternate instructions for VMS POSIX. +@end menu + +@node VMS Compilation, VMS Installation Details, VMS Installation, VMS Installation +@subsection Compiling @code{gawk} under VMS + +To compile @code{gawk} under VMS, there is a @code{DCL} command procedure that +will issue all the necessary @code{CC} and @code{LINK} commands, and there is +also a @file{Makefile} for use with the @code{MMS} utility. From the source +directory, use either + +@smallexample +$ @@[.VMS]VMSBUILD.COM +@end smallexample + +@noindent +or + +@smallexample +$ MMS/DESCRIPTION=[.VMS]DECSRIP.MMS GAWK +@end smallexample + +Depending upon which C compiler you are using, follow one of the sets +of instructions in this table: + +@table @asis +@item VAX C V3.x +Use either @file{vmsbuild.com} or @file{descrip.mms} as is. These use +@code{CC/OPTIMIZE=NOLINE}, which is essential for Version 3.0. + +@item VAX C V2.x +You must have Version 2.3 or 2.4; older ones won't work. Edit either +@file{vmsbuild.com} or @file{descrip.mms} according to the comments in them. +For @file{vmsbuild.com}, this just entails removing two @samp{!} delimiters. +Also edit @file{config.h} (which is a copy of file @file{[.config]vms-conf.h}) +and comment out or delete the two lines @samp{#define __STDC__ 0} and +@samp{#define VAXC_BUILTINS} near the end.@refill + +@item GNU C +Edit @file{vmsbuild.com} or @file{descrip.mms}; the changes are different +from those for VAX C V2.x, but equally straightforward. No changes to +@file{config.h} should be needed. + +@item DEC C +Edit @file{vmsbuild.com} or @file{descrip.mms} according to their comments. +No changes to @file{config.h} should be needed. +@end table + +@code{gawk} 2.15 has been tested under VAX/VMS 5.5-1 using VAX C V3.2, +GNU C 1.40 and 2.3. It should work without modifications for VMS V4.6 and up. + +@node VMS Installation Details, VMS Running, VMS Compilation, VMS Installation +@subsection Installing @code{gawk} on VMS + +To install @code{gawk}, all you need is a ``foreign'' command, which is +a @code{DCL} symbol whose value begins with a dollar sign. + +@smallexample +$ GAWK :== $device:[directory]GAWK +@end smallexample + +@noindent +(Substitute the actual location of @code{gawk.exe} for +@samp{device:[directory]}.) The symbol should be placed in the +@file{login.com} of any user who wishes to run @code{gawk}, +so that it will be defined every time the user logs on. +Alternatively, the symbol may be placed in the system-wide +@file{sylogin.com} procedure, which will allow all users +to run @code{gawk}.@refill + +Optionally, the help entry can be loaded into a VMS help library: + +@smallexample +$ LIBRARY/HELP SYS$HELP:HELPLIB [.VMS]GAWK.HLP +@end smallexample + +@noindent +(You may want to substitute a site-specific help library rather than +the standard VMS library @samp{HELPLIB}.) After loading the help text, + +@c this is so tiny, but `should' be smallexample for consistency sake... +@c I didn't because it was so short. --mew 29jan1992 +@example +$ HELP GAWK +@end example + +@noindent +will provide information about both the @code{gawk} implementation and the +@code{awk} programming language. + +The logical name @samp{AWK_LIBRARY} can designate a default location +for @code{awk} program files. For the @samp{-f} option, if the specified +filename has no device or directory path information in it, @code{gawk} +will look in the current directory first, then in the directory specified +by the translation of @samp{AWK_LIBRARY} if the file was not found. +If after searching in both directories, the file still is not found, +then @code{gawk} appends the suffix @samp{.awk} to the filename and the +file search will be re-tried. If @samp{AWK_LIBRARY} is not defined, that +portion of the file search will fail benignly.@refill + +@node VMS Running, VMS POSIX, VMS Installation Details, VMS Installation +@subsection Running @code{gawk} on VMS + +Command line parsing and quoting conventions are significantly different +on VMS, so examples in this manual or from other sources often need minor +changes. They @emph{are} minor though, and all @code{awk} programs +should run correctly. + +Here are a couple of trivial tests: + +@smallexample +$ gawk -- "BEGIN @{print ""Hello, World!""@}" +$ gawk -"W" version ! could also be -"W version" or "-W version" +@end smallexample + +@noindent +Note that upper-case and mixed-case text must be quoted. + +The VMS port of @code{gawk} includes a @code{DCL}-style interface in addition +to the original shell-style interface (see the help entry for details). +One side-effect of dual command line parsing is that if there is only a +single parameter (as in the quoted string program above), the command +becomes ambiguous. To work around this, the normally optional @samp{--} +flag is required to force Unix style rather than @code{DCL} parsing. If any +other dash-type options (or multiple parameters such as data files to be +processed) are present, there is no ambiguity and @samp{--} can be omitted. + +The default search path when looking for @code{awk} program files specified +by the @samp{-f} option is @code{"SYS$DISK:[],AWK_LIBRARY:"}. The logical +name @samp{AWKPATH} can be used to override this default. The format +of @samp{AWKPATH} is a comma-separated list of directory specifications. +When defining it, the value should be quoted so that it retains a single +translation, and not a multi-translation @code{RMS} searchlist. + +@node VMS POSIX, , VMS Running, VMS Installation +@subsection Building and using @code{gawk} under VMS POSIX + +Ignore the instructions above, although @file{vms/gawk.hlp} should still +be made available in a help library. Make sure that the two scripts, +@file{configure} and @file{mungeconf}, are executable; use @samp{chmod +x} +on them if necessary. Then execute the following commands: + +@smallexample +$ POSIX +psx> configure vms-posix +psx> make awktab.c gawk +@end smallexample + +@noindent +The first command will construct files @file{config.h} and @file{Makefile} +out of templates. The second command will compile and link @code{gawk}. +Due to a @code{make} bug in VMS POSIX V1.0 and V1.1, +the file @file{awktab.c} must be given as an explicit target or it will +not be built and the final link step will fail. Ignore the warning +@samp{"Could not find lib m in lib list"}; it is harmless, caused by the +explicit use of @samp{-lm} as a linker option which is not needed +under VMS POSIX. Under V1.1 (but not V1.0) a problem with the @code{yacc} +skeleton @file{/etc/yyparse.c} will cause a compiler warning for +@file{awktab.c}, followed by a linker warning about compilation warnings +in the resulting object module. These warnings can be ignored.@refill + +Once built, @code{gawk} will work like any other shell utility. Unlike +the normal VMS port of @code{gawk}, no special command line manipulation is +needed in the VMS POSIX environment. + +@node MS-DOS Installation, Atari Installation, VMS Installation, Installation +@section Installing @code{gawk} on MS-DOS + +@cindex installation, ms-dos +The first step is to get all the files in the @code{gawk} distribution +onto your PC. Move all the files from the @file{pc} directory into +the main directory where the other files are. Edit the file +@file{make.bat} so that it will be an acceptable MS-DOS batch file. +This means making sure that all lines are terminated with the ASCII +carriage return and line feed characters. +restrictions. + +@code{gawk} has only been compiled with version 5.1 of the Microsoft +C compiler. The file @file{make.bat} from the @file{pc} directory +assumes that you have this compiler. + +Copy the file @file{setargv.obj} from the library directory where it +resides to the @code{gawk} source code directory. + +Run @file{make.bat}. This will compile @code{gawk} for you, and link it. +That's all there is to it! + +@node Atari Installation, , MS-DOS Installation, Installation +@section Installing @code{gawk} on the Atari ST + +@c based on material from +@c Michal Jaegermann <ntomczak@vm.ucs.ualberta.ca> + +@cindex installation, atari +This section assumes that you are running TOS. It applies to other Atari +models (STe, TT) as well. + +In order to use @code{gawk}, you need to have a shell, either text or +graphics, that does not map all the characters of a command line to +upper case. Maintaining case distinction in option flags is very +important (@pxref{Command Line, ,Invoking @code{awk}}). Popular shells +like @code{gulam} or @code{gemini} will work, as will newer versions of +@code{desktop}. Support for I/O redirection is necessary to make it easy +to import @code{awk} programs from other environments. Pipes are nice to have, +but not vital. + +If you have received an executable version of @code{gawk}, place it, +as usual, anywhere in your @code{PATH} where your shell will find it. + +While executing, @code{gawk} creates a number of temporary files. +@code{gawk} looks for either of the environment variables @code{TEMP} +or @code{TMPDIR}, in that order. If either one is found, its value +is assumed to be a directory for temporary files. This directory +must exist, and if you can spare the memory, it is a good idea to +put it on a @sc{ram} drive. If neither @code{TEMP} nor @code{TMPDIR} +are found, then @code{gawk} uses the current directory for its +temporary files. + +The ST version of @code{gawk} searches for its program files as +described in @ref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}. +On the ST, the default value for the @code{AWKPATH} variable is +@code{@w{".,c:\lib\awk,c:\gnu\lib\awk"}}. +The search path can be modified by explicitly setting @code{AWKPATH} to +whatever you wish. Note that colons cannot be used on the ST to separate +elements in the @code{AWKPATH} variable, since they have another, reserved, +meaning. Instead, you must use a comma to separate elements in the path. +If you are recompiling @code{gawk} on the ST, then you can choose a new +default search path, by setting the value of @samp{DEFPATH} in the file +@file{...\config\atari}. You may choose a different separator character +by setting the value of @samp{ENVSEP} in the same file. The new values will +be used when creating the header file @file{config.h}.@refill + +@ignore +As a last resort, small +adjustments can be made directly on the executable version of @code{gawk} +using a binary editor.@refill +@end ignore + +Although @code{awk} allows great flexibility in doing I/O redirections +from within a program, this facility should be used with care on the ST. +In some circumstances the OS routines for file handle pool processing +lose track of certain events, causing the computer to crash, and requiring +a reboot. Often a warm reboot is sufficient. Fortunately, this happens +infrequently, and in rather esoteric situations. In particular, avoid +having one part of an @code{awk} program using @code{print} +statements explicitly redirected to @code{"/dev/stdout"}, while other +@code{print} statements use the default standard output, and a +calling shell has redirected standard output to a file.@refill +@c whew! + +When @code{gawk} is compiled with the ST version of @code{gcc} and its +usual libraries, it will accept both @samp{/} and @samp{\} as path separators. +While this is convenient, it should be remembered that this removes one, +technically legal, character (@samp{/}) from your file names, and that +it may create problems for external programs, called via the @code{system()} +function, which may not support this convention. Whenever it is possible +that a file created by @code{gawk} will be used by some other program, +use only backslashes. Also remember that in @code{awk}, backslashes in +strings have to be doubled in order to get literal backslashes. + +The initial port of @code{gawk} to the ST was done with @code{gcc}. +If you wish to recompile @code{gawk} from scratch, you will need to use +a compiler that accepts @sc{ansi} standard C (such as @code{gcc}, Turbo C, +or Prospero C). If @code{sizeof(int) != @w{sizeof(int *)}}, the correctness +of the generated code depends heavily on the fact that all function calls +have function prototypes in the current scope. If your compiler does +not accept function prototypes, you will probably have to add a +number of casts to the code.@refill + +If you are using @code{gcc}, make sure that you have up-to-date libraries. +Older versions have problems with some library functions (@code{atan2()}, +@code{strftime()}, the @samp{%g} conversion in @code{sprintf()}) which +may affect the operation of @code{gawk}. + +In the @file{atari} subdirectory of the @code{gawk} distribution is +a version of the @code{system()} function that has been tested with +@code{gulam} and @code{msh}; it should work with other shells as well. +With @code{gulam}, it passes the string to be executed without spawning +an extra copy of a shell. It is possible to replace this version of +@code{system()} with a similar function from a library or from some other +source if that version would be a better choice for the shell you prefer. + +The files needed to recompile @code{gawk} on the ST can be found in +the @file{atari} directory. The provided files and instructions below +assume that you have the GNU C compiler (@code{gcc}), the @code{gulam} shell, +and an ST version of @code{sed}. The @file{Makefile} is set up to use +@file{byacc} as a @file{yacc} replacement. With a different set of tools some +adjustments and/or editing will be needed.@refill + +@code{cd} to the @file{atari} directory. Copy @file{Makefile.st} to +@file{makefile} in the source (parent) directory. Possibly adjust +@file{../config/atari} to suit your system. Execute the script @file{mkconf.g} +which will create the header file @file{../config.h}. Go back to the source +directory. If you are not using @code{gcc}, check the file @file{missing.c}. +It may be necessary to change forward slashes in the references to files +from the @file{atari} subdirectory into backslashes. Type @code{make} and +enjoy.@refill + +Compilation with @code{gcc} of some of the bigger modules, like +@file{awk_tab.c}, may require a full four megabytes of memory. On smaller +machines you would need to cut down on optimizations, or you would have to +switch to another, less memory hungry, compiler.@refill + +@node Gawk Summary, Sample Program, Installation, Top +@appendix @code{gawk} Summary + +This appendix provides a brief summary of the @code{gawk} command line and the +@code{awk} language. It is designed to serve as ``quick reference.'' It is +therefore terse, but complete. + +@menu +* Command Line Summary:: Recapitulation of the command line. +* Language Summary:: A terse review of the language. +* Variables/Fields:: Variables, fields, and arrays. +* Rules Summary:: Patterns and Actions, and their + component parts. +* Functions Summary:: Defining and calling functions. +* Historical Features:: Some undocumented but supported ``features''. +@end menu + +@node Command Line Summary, Language Summary, Gawk Summary, Gawk Summary +@appendixsec Command Line Options Summary + +The command line consists of options to @code{gawk} itself, the +@code{awk} program text (if not supplied via the @samp{-f} option), and +values to be made available in the @code{ARGC} and @code{ARGV} +predefined @code{awk} variables: + +@example +awk @r{[@var{POSIX or GNU style options}]} -f source-file @r{[@code{--}]} @var{file} @dots{} +awk @r{[@var{POSIX or GNU style options}]} @r{[@code{--}]} '@var{program}' @var{file} @dots{} +@end example + +The options that @code{gawk} accepts are: + +@table @code +@item -F @var{fs} +@itemx --field-separator=@var{fs} +Use @var{fs} for the input field separator (the value of the @code{FS} +predefined variable). + +@item -f @var{program-file} +@itemx --file=@var{program-file} +Read the @code{awk} program source from the file @var{program-file}, instead +of from the first command line argument. + +@item -v @var{var}=@var{val} +@itemx --assign=@var{var}=@var{val} +Assign the variable @var{var} the value @var{val} before program execution +begins. + +@item -W compat +@itemx --compat +Specifies compatibility mode, in which @code{gawk} extensions are turned +off. + +@item -W copyleft +@itemx -W copyright +@itemx --copyleft +@itemx --copyright +Print the short version of the General Public License on the error +output. This option may disappear in a future version of @code{gawk}. + +@item -W help +@itemx -W usage +@itemx --help +@itemx --usage +Print a relatively short summary of the available options on the error output. + +@item -W lint +@itemx --lint +Give warnings about dubious or non-portable @code{awk} constructs. + +@item -W posix +@itemx --posix +Specifies @sc{posix} compatibility mode, in which @code{gawk} extensions +are turned off and additional restrictions apply. + +@item -W source=@var{program-text} +@itemx --source=@var{program-text} +Use @var{program-text} as @code{awk} program source code. This option allows +mixing command line source code with source code from files, and is +particularly useful for mixing command line programs with library functions. + +@item -W version +@itemx --version +Print version information for this particular copy of @code{gawk} on the error +output. This option may disappear in a future version of @code{gawk}. + +@item -- +Signal the end of options. This is useful to allow further arguments to the +@code{awk} program itself to start with a @samp{-}. This is mainly for +consistency with the argument parsing conventions of @sc{posix}. +@end table + +Any other options are flagged as invalid, but are otherwise ignored. +@xref{Command Line, ,Invoking @code{awk}}, for more details. + +@node Language Summary, Variables/Fields, Command Line Summary, Gawk Summary +@appendixsec Language Summary + +An @code{awk} program consists of a sequence of pattern-action statements +and optional function definitions. + +@example +@var{pattern} @{ @var{action statements} @} + +function @var{name}(@var{parameter list}) @{ @var{action statements} @} +@end example + +@code{gawk} first reads the program source from the +@var{program-file}(s) if specified, or from the first non-option +argument on the command line. The @samp{-f} option may be used multiple +times on the command line. @code{gawk} reads the program text from all +the @var{program-file} files, effectively concatenating them in the +order they are specified. This is useful for building libraries of +@code{awk} functions, without having to include them in each new +@code{awk} program that uses them. To use a library function in a file +from a program typed in on the command line, specify @samp{-f /dev/tty}; +then type your program, and end it with a @kbd{Control-d}. +@xref{Command Line, ,Invoking @code{awk}}.@refill + +The environment variable @code{AWKPATH} specifies a search path to use +when finding source files named with the @samp{-f} option. The default +path, which is +@samp{.:/usr/lib/awk:/usr/local/lib/awk} is used if @code{AWKPATH} is not set. +If a file name given to the @samp{-f} option contains a @samp{/} character, +no path search is performed. +@xref{AWKPATH Variable, ,The @code{AWKPATH} Environment Variable}, +for a full description of the @code{AWKPATH} environment variable.@refill + +@code{gawk} compiles the program into an internal form, and then proceeds to +read each file named in the @code{ARGV} array. If there are no files named +on the command line, @code{gawk} reads the standard input. + +If a ``file'' named on the command line has the form +@samp{@var{var}=@var{val}}, it is treated as a variable assignment: the +variable @var{var} is assigned the value @var{val}. +If any of the files have a value that is the null string, that +element in the list is skipped.@refill + +For each line in the input, @code{gawk} tests to see if it matches any +@var{pattern} in the @code{awk} program. For each pattern that the line +matches, the associated @var{action} is executed. + +@node Variables/Fields, Rules Summary, Language Summary, Gawk Summary +@appendixsec Variables and Fields + +@code{awk} variables are dynamic; they come into existence when they are +first used. Their values are either floating-point numbers or strings. +@code{awk} also has one-dimension arrays; multiple-dimensional arrays +may be simulated. There are several predefined variables that +@code{awk} sets as a program runs; these are summarized below. + +@menu +* Fields Summary:: Input field splitting. +* Built-in Summary:: @code{awk}'s built-in variables. +* Arrays Summary:: Using arrays. +* Data Type Summary:: Values in @code{awk} are numbers or strings. +@end menu + +@node Fields Summary, Built-in Summary, Variables/Fields, Variables/Fields +@appendixsubsec Fields + +As each input line is read, @code{gawk} splits the line into +@var{fields}, using the value of the @code{FS} variable as the field +separator. If @code{FS} is a single character, fields are separated by +that character. Otherwise, @code{FS} is expected to be a full regular +expression. In the special case that @code{FS} is a single blank, +fields are separated by runs of blanks and/or tabs. Note that the value +of @code{IGNORECASE} (@pxref{Case-sensitivity, ,Case-sensitivity in Matching}) +also affects how fields are split when @code{FS} is a regular expression.@refill + +Each field in the input line may be referenced by its position, @code{$1}, +@code{$2}, and so on. @code{$0} is the whole line. The value of a field may +be assigned to as well. Field numbers need not be constants: + +@example +n = 5 +print $n +@end example + +@noindent +prints the fifth field in the input line. The variable @code{NF} is set to +the total number of fields in the input line. + +References to nonexistent fields (i.e., fields after @code{$NF}) return +the null-string. However, assigning to a nonexistent field (e.g., +@code{$(NF+2) = 5}) increases the value of @code{NF}, creates any +intervening fields with the null string as their value, and causes the +value of @code{$0} to be recomputed, with the fields being separated by +the value of @code{OFS}.@refill + +@xref{Reading Files, ,Reading Input Files}, for a full description of the +way @code{awk} defines and uses fields. + +@node Built-in Summary, Arrays Summary, Fields Summary, Variables/Fields +@appendixsubsec Built-in Variables + +@code{awk}'s built-in variables are: + +@table @code +@item ARGC +The number of command line arguments (not including options or the +@code{awk} program itself). + +@item ARGIND +The index in @code{ARGV} of the current file being processed. +It is always true that @samp{FILENAME == ARGV[ARGIND]}. + +@item ARGV +The array of command line arguments. The array is indexed from 0 to +@code{ARGC} @minus{} 1. Dynamically changing the contents of @code{ARGV} +can control the files used for data.@refill + +@item CONVFMT +The conversion format to use when converting numbers to strings. + +@item FIELDWIDTHS +A space separated list of numbers describing the fixed-width input data. + +@item ENVIRON +An array containing the values of the environment variables. The array +is indexed by variable name, each element being the value of that +variable. Thus, the environment variable @code{HOME} would be in +@code{ENVIRON["HOME"]}. Its value might be @file{/u/close}. + +Changing this array does not affect the environment seen by programs +which @code{gawk} spawns via redirection or the @code{system} function. +(This may change in a future version of @code{gawk}.) + +Some operating systems do not have environment variables. +The array @code{ENVIRON} is empty when running on these systems. + +@item ERRNO +The system error message when an error occurs using @code{getline} +or @code{close}. + +@item FILENAME +The name of the current input file. If no files are specified on the command +line, the value of @code{FILENAME} is @samp{-}. + +@item FNR +The input record number in the current input file. + +@item FS +The input field separator, a blank by default. + +@item IGNORECASE +The case-sensitivity flag for regular expression operations. If +@code{IGNORECASE} has a nonzero value, then pattern matching in rules, +field splitting with @code{FS}, regular expression matching with +@samp{~} and @samp{!~}, and the @code{gsub}, @code{index}, @code{match}, +@code{split} and @code{sub} predefined functions all ignore case +when doing regular expression operations.@refill + +@item NF +The number of fields in the current input record. + +@item NR +The total number of input records seen so far. + +@item OFMT +The output format for numbers for the @code{print} statement, +@code{"%.6g"} by default. + +@item OFS +The output field separator, a blank by default. + +@item ORS +The output record separator, by default a newline. + +@item RS +The input record separator, by default a newline. @code{RS} is exceptional +in that only the first character of its string value is used for separating +records. If @code{RS} is set to the null string, then records are separated by +blank lines. When @code{RS} is set to the null string, then the newline +character always acts as a field separator, in addition to whatever value +@code{FS} may have.@refill + +@item RSTART +The index of the first character matched by @code{match}; 0 if no match. + +@item RLENGTH +The length of the string matched by @code{match}; @minus{}1 if no match. + +@item SUBSEP +The string used to separate multiple subscripts in array elements, by +default @code{"\034"}. +@end table + +@xref{Built-in Variables}, for more information. + +@node Arrays Summary, Data Type Summary, Built-in Summary, Variables/Fields +@appendixsubsec Arrays + +Arrays are subscripted with an expression between square brackets +(@samp{[} and @samp{]}). Array subscripts are @emph{always} strings; +numbers are converted to strings as necessary, following the standard +conversion rules +(@pxref{Conversion, ,Conversion of Strings and Numbers}).@refill + +If you use multiple expressions separated by commas inside the square +brackets, then the array subscript is a string consisting of the +concatenation of the individual subscript values, converted to strings, +separated by the subscript separator (the value of @code{SUBSEP}). + +The special operator @code{in} may be used in an @code{if} or +@code{while} statement to see if an array has an index consisting of a +particular value. + +@example +if (val in array) + print array[val] +@end example + +If the array has multiple subscripts, use @code{(i, j, @dots{}) in array} +to test for existence of an element. + +The @code{in} construct may also be used in a @code{for} loop to iterate +over all the elements of an array. +@xref{Scanning an Array, ,Scanning all Elements of an Array}.@refill + +An element may be deleted from an array using the @code{delete} statement. + +@xref{Arrays, ,Arrays in @code{awk}}, for more detailed information. + +@node Data Type Summary, , Arrays Summary, Variables/Fields +@appendixsubsec Data Types + +The value of an @code{awk} expression is always either a number +or a string. + +Certain contexts (such as arithmetic operators) require numeric +values. They convert strings to numbers by interpreting the text +of the string as a numeral. If the string does not look like a +numeral, it converts to 0. + +Certain contexts (such as concatenation) require string values. +They convert numbers to strings by effectively printing them +with @code{sprintf}. +@xref{Conversion, ,Conversion of Strings and Numbers}, for the details.@refill + +To force conversion of a string value to a number, simply add 0 +to it. If the value you start with is already a number, this +does not change it. + +To force conversion of a numeric value to a string, concatenate it with +the null string. + +The @code{awk} language defines comparisons as being done numerically if +both operands are numeric, or if one is numeric and the other is a numeric +string. Otherwise one or both operands are converted to strings and a +string comparison is performed. + +Uninitialized variables have the string value @code{""} (the null, or +empty, string). In contexts where a number is required, this is +equivalent to 0. + +@xref{Variables}, for more information on variable naming and initialization; +@pxref{Conversion, ,Conversion of Strings and Numbers}, for more information +on how variable values are interpreted.@refill + +@node Rules Summary, Functions Summary, Variables/Fields, Gawk Summary +@appendixsec Patterns and Actions + +@menu +* Pattern Summary:: Quick overview of patterns. +* Regexp Summary:: Quick overview of regular expressions. +* Actions Summary:: Quick overview of actions. +@end menu + +An @code{awk} program is mostly composed of rules, each consisting of a +pattern followed by an action. The action is enclosed in @samp{@{} and +@samp{@}}. Either the pattern may be missing, or the action may be +missing, but, of course, not both. If the pattern is missing, the +action is executed for every single line of input. A missing action is +equivalent to this action, + +@example +@{ print @} +@end example + +@noindent +which prints the entire line. + +Comments begin with the @samp{#} character, and continue until the end of the +line. Blank lines may be used to separate statements. Normally, a statement +ends with a newline, however, this is not the case for lines ending in a +@samp{,}, @samp{@{}, @samp{?}, @samp{:}, @samp{&&}, or @samp{||}. Lines +ending in @code{do} or @code{else} also have their statements automatically +continued on the following line. In other cases, a line can be continued by +ending it with a @samp{\}, in which case the newline is ignored.@refill + +Multiple statements may be put on one line by separating them with a @samp{;}. +This applies to both the statements within the action part of a rule (the +usual case), and to the rule statements. + +@xref{Comments, ,Comments in @code{awk} Programs}, for information on +@code{awk}'s commenting convention; +@pxref{Statements/Lines, ,@code{awk} Statements versus Lines}, for a +description of the line continuation mechanism in @code{awk}.@refill + +@node Pattern Summary, Regexp Summary, Rules Summary, Rules Summary +@appendixsubsec Patterns + +@code{awk} patterns may be one of the following: + +@example +/@var{regular expression}/ +@var{relational expression} +@var{pattern} && @var{pattern} +@var{pattern} || @var{pattern} +@var{pattern} ? @var{pattern} : @var{pattern} +(@var{pattern}) +! @var{pattern} +@var{pattern1}, @var{pattern2} +BEGIN +END +@end example + +@code{BEGIN} and @code{END} are two special kinds of patterns that are not +tested against the input. The action parts of all @code{BEGIN} rules are +merged as if all the statements had been written in a single @code{BEGIN} +rule. They are executed before any of the input is read. Similarly, all the +@code{END} rules are merged, and executed when all the input is exhausted (or +when an @code{exit} statement is executed). @code{BEGIN} and @code{END} +patterns cannot be combined with other patterns in pattern expressions. +@code{BEGIN} and @code{END} rules cannot have missing action parts.@refill + +For @samp{/@var{regular-expression}/} patterns, the associated statement is +executed for each input line that matches the regular expression. Regular +expressions are extensions of those in @code{egrep}, and are summarized below. + +A @var{relational expression} may use any of the operators defined below in +the section on actions. These generally test whether certain fields match +certain regular expressions. + +The @samp{&&}, @samp{||}, and @samp{!} operators are logical ``and,'' +logical ``or,'' and logical ``not,'' respectively, as in C. They do +short-circuit evaluation, also as in C, and are used for combining more +primitive pattern expressions. As in most languages, parentheses may be +used to change the order of evaluation. + +The @samp{?:} operator is like the same operator in C. If the first +pattern matches, then the second pattern is matched against the input +record; otherwise, the third is matched. Only one of the second and +third patterns is matched. + +The @samp{@var{pattern1}, @var{pattern2}} form of a pattern is called a +range pattern. It matches all input lines starting with a line that +matches @var{pattern1}, and continuing until a line that matches +@var{pattern2}, inclusive. A range pattern cannot be used as an operand +to any of the pattern operators. + +@xref{Patterns}, for a full description of the pattern part of @code{awk} +rules. + +@node Regexp Summary, Actions Summary, Pattern Summary, Rules Summary +@appendixsubsec Regular Expressions + +Regular expressions are the extended kind found in @code{egrep}. +They are composed of characters as follows: + +@table @code +@item @var{c} +matches the character @var{c} (assuming @var{c} is a character with no +special meaning in regexps). + +@item \@var{c} +matches the literal character @var{c}. + +@item . +matches any character except newline. + +@item ^ +matches the beginning of a line or a string. + +@item $ +matches the end of a line or a string. + +@item [@var{abc}@dots{}] +matches any of the characters @var{abc}@dots{} (character class). + +@item [^@var{abc}@dots{}] +matches any character except @var{abc}@dots{} and newline (negated +character class). + +@item @var{r1}|@var{r2} +matches either @var{r1} or @var{r2} (alternation). + +@item @var{r1r2} +matches @var{r1}, and then @var{r2} (concatenation). + +@item @var{r}+ +matches one or more @var{r}'s. + +@item @var{r}* +matches zero or more @var{r}'s. + +@item @var{r}? +matches zero or one @var{r}'s. + +@item (@var{r}) +matches @var{r} (grouping). +@end table + +@xref{Regexp, ,Regular Expressions as Patterns}, for a more detailed +explanation of regular expressions. + +The escape sequences allowed in string constants are also valid in +regular expressions (@pxref{Constants, ,Constant Expressions}). + +@node Actions Summary, , Regexp Summary, Rules Summary +@appendixsubsec Actions + +Action statements are enclosed in braces, @samp{@{} and @samp{@}}. +Action statements consist of the usual assignment, conditional, and looping +statements found in most languages. The operators, control statements, +and input/output statements available are patterned after those in C. + +@menu +* Operator Summary:: @code{awk} operators. +* Control Flow Summary:: The control statements. +* I/O Summary:: The I/O statements. +* Printf Summary:: A summary of @code{printf}. +* Special File Summary:: Special file names interpreted internally. +* Numeric Functions Summary:: Built-in numeric functions. +* String Functions Summary:: Built-in string functions. +* Time Functions Summary:: Built-in time functions. +* String Constants Summary:: Escape sequences in strings. +@end menu + +@node Operator Summary, Control Flow Summary, Actions Summary, Actions Summary +@appendixsubsubsec Operators + +The operators in @code{awk}, in order of increasing precedence, are: + +@table @code +@item = += -= *= /= %= ^= +Assignment. Both absolute assignment (@code{@var{var}=@var{value}}) +and operator assignment (the other forms) are supported. + +@item ?: +A conditional expression, as in C. This has the form @code{@var{expr1} ? +@var{expr2} : @var{expr3}}. If @var{expr1} is true, the value of the +expression is @var{expr2}; otherwise it is @var{expr3}. Only one of +@var{expr2} and @var{expr3} is evaluated.@refill + +@item || +Logical ``or''. + +@item && +Logical ``and''. + +@item ~ !~ +Regular expression match, negated match. + +@item < <= > >= != == +The usual relational operators. + +@item @var{blank} +String concatenation. + +@item + - +Addition and subtraction. + +@item * / % +Multiplication, division, and modulus. + +@item + - ! +Unary plus, unary minus, and logical negation. + +@item ^ +Exponentiation (@samp{**} may also be used, and @samp{**=} for the assignment +operator, but they are not specified in the @sc{posix} standard). + +@item ++ -- +Increment and decrement, both prefix and postfix. + +@item $ +Field reference. +@end table + +@xref{Expressions, ,Expressions as Action Statements}, for a full +description of all the operators listed above. +@xref{Fields, ,Examining Fields}, for a description of the field +reference operator.@refill + +@node Control Flow Summary, I/O Summary, Operator Summary, Actions Summary +@appendixsubsubsec Control Statements + +The control statements are as follows: + +@example +if (@var{condition}) @var{statement} @r{[} else @var{statement} @r{]} +while (@var{condition}) @var{statement} +do @var{statement} while (@var{condition}) +for (@var{expr1}; @var{expr2}; @var{expr3}) @var{statement} +for (@var{var} in @var{array}) @var{statement} +break +continue +delete @var{array}[@var{index}] +exit @r{[} @var{expression} @r{]} +@{ @var{statements} @} +@end example + +@xref{Statements, ,Control Statements in Actions}, for a full description +of all the control statements listed above. + +@node I/O Summary, Printf Summary, Control Flow Summary, Actions Summary +@appendixsubsubsec I/O Statements + +The input/output statements are as follows: + +@table @code +@item getline +Set @code{$0} from next input record; set @code{NF}, @code{NR}, @code{FNR}. + +@item getline <@var{file} +Set @code{$0} from next record of @var{file}; set @code{NF}. + +@item getline @var{var} +Set @var{var} from next input record; set @code{NF}, @code{FNR}. + +@item getline @var{var} <@var{file} +Set @var{var} from next record of @var{file}. + +@item next +Stop processing the current input record. The next input record is read and +processing starts over with the first pattern in the @code{awk} program. +If the end of the input data is reached, the @code{END} rule(s), if any, +are executed. + +@item next file +Stop processing the current input file. The next input record read comes +from the next input file. @code{FILENAME} is updated, @code{FNR} is set to 1, +and processing starts over with the first pattern in the @code{awk} program. +If the end of the input data is reached, the @code{END} rule(s), if any, +are executed. + +@item print +Prints the current record. + +@item print @var{expr-list} +Prints expressions. + +@item print @var{expr-list} > @var{file} +Prints expressions on @var{file}. + +@item printf @var{fmt, expr-list} +Format and print. + +@item printf @var{fmt, expr-list} > file +Format and print on @var{file}. +@end table + +Other input/output redirections are also allowed. For @code{print} and +@code{printf}, @samp{>> @var{file}} appends output to the @var{file}, +and @samp{| @var{command}} writes on a pipe. In a similar fashion, +@samp{@var{command} | getline} pipes input into @code{getline}. +@code{getline} returns 0 on end of file, and @minus{}1 on an error.@refill + +@xref{Getline, ,Explicit Input with @code{getline}}, for a full description +of the @code{getline} statement. +@xref{Printing, ,Printing Output}, for a full description of @code{print} and +@code{printf}. Finally, @pxref{Next Statement, ,The @code{next} Statement}, +for a description of how the @code{next} statement works.@refill + +@node Printf Summary, Special File Summary, I/O Summary, Actions Summary +@appendixsubsubsec @code{printf} Summary + +The @code{awk} @code{printf} statement and @code{sprintf} function +accept the following conversion specification formats: + +@table @code +@item %c +An ASCII character. If the argument used for @samp{%c} is numeric, it is +treated as a character and printed. Otherwise, the argument is assumed to +be a string, and the only first character of that string is printed. + +@item %d +@itemx %i +A decimal number (the integer part). + +@item %e +A floating point number of the form +@samp{@r{[}-@r{]}d.ddddddE@r{[}+-@r{]}dd}.@refill + +@item %f +A floating point number of the form +@r{[}@code{-}@r{]}@code{ddd.dddddd}. + +@item %g +Use @samp{%e} or @samp{%f} conversion, whichever produces a shorter string, +with nonsignificant zeros suppressed. + +@item %o +An unsigned octal number (again, an integer). + +@item %s +A character string. + +@item %x +An unsigned hexadecimal number (an integer). + +@item %X +Like @samp{%x}, except use @samp{A} through @samp{F} instead of @samp{a} +through @samp{f} for decimal 10 through 15.@refill + +@item %% +A single @samp{%} character; no argument is converted. +@end table + +There are optional, additional parameters that may lie between the @samp{%} +and the control letter: + +@table @code +@item - +The expression should be left-justified within its field. + +@item @var{width} +The field should be padded to this width. If @var{width} has a leading zero, +then the field is padded with zeros. Otherwise it is padded with blanks. + +@item .@var{prec} +A number indicating the maximum width of strings or digits to the right +of the decimal point. +@end table + +Either or both of the @var{width} and @var{prec} values may be specified +as @samp{*}. In that case, the particular value is taken from the argument +list. + +@xref{Printf, ,Using @code{printf} Statements for Fancier Printing}, for +examples and for a more detailed description. + +@node Special File Summary, Numeric Functions Summary, Printf Summary, Actions Summary +@appendixsubsubsec Special File Names + +When doing I/O redirection from either @code{print} or @code{printf} into a +file, or via @code{getline} from a file, @code{gawk} recognizes certain special +file names internally. These file names allow access to open file descriptors +inherited from @code{gawk}'s parent process (usually the shell). The +file names are: + +@table @file +@item /dev/stdin +The standard input. + +@item /dev/stdout +The standard output. + +@item /dev/stderr +The standard error output. + +@item /dev/fd/@var{n} +The file denoted by the open file descriptor @var{n}. +@end table + +In addition the following files provide process related information +about the running @code{gawk} program. + +@table @file +@item /dev/pid +Reading this file returns the process ID of the current process, +in decimal, terminated with a newline. + +@item /dev/ppid +Reading this file returns the parent process ID of the current process, +in decimal, terminated with a newline. + +@item /dev/pgrpid +Reading this file returns the process group ID of the current process, +in decimal, terminated with a newline. + +@item /dev/user +Reading this file returns a single record terminated with a newline. +The fields are separated with blanks. The fields represent the +following information: + +@table @code +@item $1 +The value of the @code{getuid} system call. + +@item $2 +The value of the @code{geteuid} system call. + +@item $3 +The value of the @code{getgid} system call. + +@item $4 +The value of the @code{getegid} system call. +@end table + +If there are any additional fields, they are the group IDs returned by +@code{getgroups} system call. +(Multiple groups may not be supported on all systems.)@refill +@end table + +@noindent +These file names may also be used on the command line to name data files. +These file names are only recognized internally if you do not +actually have files by these names on your system. + +@xref{Special Files, ,Standard I/O Streams}, for a longer description that +provides the motivation for this feature. + +@node Numeric Functions Summary, String Functions Summary, Special File Summary, Actions Summary +@appendixsubsubsec Numeric Functions + +@code{awk} has the following predefined arithmetic functions: + +@table @code +@item atan2(@var{y}, @var{x}) +returns the arctangent of @var{y/x} in radians. + +@item cos(@var{expr}) +returns the cosine in radians. + +@item exp(@var{expr}) +the exponential function. + +@item int(@var{expr}) +truncates to integer. + +@item log(@var{expr}) +the natural logarithm function. + +@item rand() +returns a random number between 0 and 1. + +@item sin(@var{expr}) +returns the sine in radians. + +@item sqrt(@var{expr}) +the square root function. + +@item srand(@var{expr}) +use @var{expr} as a new seed for the random number generator. If no @var{expr} +is provided, the time of day is used. The return value is the previous +seed for the random number generator. +@end table + +@node String Functions Summary, Time Functions Summary, Numeric Functions Summary, Actions Summary +@appendixsubsubsec String Functions + +@code{awk} has the following predefined string functions: + +@table @code +@item gsub(@var{r}, @var{s}, @var{t}) +for each substring matching the regular expression @var{r} in the string +@var{t}, substitute the string @var{s}, and return the number of substitutions. +If @var{t} is not supplied, use @code{$0}. + +@item index(@var{s}, @var{t}) +returns the index of the string @var{t} in the string @var{s}, or 0 if +@var{t} is not present. + +@item length(@var{s}) +returns the length of the string @var{s}. The length of @code{$0} +is returned if no argument is supplied. + +@item match(@var{s}, @var{r}) +returns the position in @var{s} where the regular expression @var{r} +occurs, or 0 if @var{r} is not present, and sets the values of @code{RSTART} +and @code{RLENGTH}. + +@item split(@var{s}, @var{a}, @var{r}) +splits the string @var{s} into the array @var{a} on the regular expression +@var{r}, and returns the number of fields. If @var{r} is omitted, @code{FS} +is used instead. + +@item sprintf(@var{fmt}, @var{expr-list}) +prints @var{expr-list} according to @var{fmt}, and returns the resulting string. + +@item sub(@var{r}, @var{s}, @var{t}) +this is just like @code{gsub}, but only the first matching substring is +replaced. + +@item substr(@var{s}, @var{i}, @var{n}) +returns the @var{n}-character substring of @var{s} starting at @var{i}. +If @var{n} is omitted, the rest of @var{s} is used. + +@item tolower(@var{str}) +returns a copy of the string @var{str}, with all the upper-case characters in +@var{str} translated to their corresponding lower-case counterparts. +Nonalphabetic characters are left unchanged. + +@item toupper(@var{str}) +returns a copy of the string @var{str}, with all the lower-case characters in +@var{str} translated to their corresponding upper-case counterparts. +Nonalphabetic characters are left unchanged. + +@item system(@var{cmd-line}) +Execute the command @var{cmd-line}, and return the exit status. +@end table + +@node Time Functions Summary, String Constants Summary, String Functions Summary, Actions Summary +@appendixsubsubsec Built-in time functions + +The following two functions are available for getting the current +time of day, and for formatting time stamps. + +@table @code +@item systime() +returns the current time of day as the number of seconds since a particular +epoch (Midnight, January 1, 1970 @sc{utc}, on @sc{posix} systems). + +@item strftime(@var{format}, @var{timestamp}) +formats @var{timestamp} according to the specification in @var{format}. +The current time of day is used if no @var{timestamp} is supplied. +@xref{Time Functions, ,Functions for Dealing with Time Stamps}, for the +details on the conversion specifiers that @code{strftime} accepts.@refill +@end table + +@iftex +@xref{Built-in, ,Built-in Functions}, for a description of all of +@code{awk}'s built-in functions. +@end iftex + +@node String Constants Summary, , Time Functions Summary, Actions Summary +@appendixsubsubsec String Constants + +String constants in @code{awk} are sequences of characters enclosed +between double quotes (@code{"}). Within strings, certain @dfn{escape sequences} +are recognized, as in C. These are: + +@table @code +@item \\ +A literal backslash. + +@item \a +The ``alert'' character; usually the ASCII BEL character. + +@item \b +Backspace. + +@item \f +Formfeed. + +@item \n +Newline. + +@item \r +Carriage return. + +@item \t +Horizontal tab. + +@item \v +Vertical tab. + +@item \x@var{hex digits} +The character represented by the string of hexadecimal digits following +the @samp{\x}. As in @sc{ansi} C, all following hexadecimal digits are +considered part of the escape sequence. (This feature should tell us +something about language design by committee.) E.g., @code{"\x1B"} is a +string containing the ASCII ESC (escape) character. (The @samp{\x} +escape sequence is not in @sc{posix} @code{awk}.) + +@item \@var{ddd} +The character represented by the 1-, 2-, or 3-digit sequence of octal +digits. Thus, @code{"\033"} is also a string containing the ASCII ESC +(escape) character. + +@item \@var{c} +The literal character @var{c}. +@end table + +The escape sequences may also be used inside constant regular expressions +(e.g., the regexp @code{@w{/[@ \t\f\n\r\v]/}} matches whitespace +characters).@refill + +@xref{Constants, ,Constant Expressions}. + +@node Functions Summary, Historical Features, Rules Summary, Gawk Summary +@appendixsec Functions + +Functions in @code{awk} are defined as follows: + +@example +function @var{name}(@var{parameter list}) @{ @var{statements} @} +@end example + +Actual parameters supplied in the function call are used to instantiate +the formal parameters declared in the function. Arrays are passed by +reference, other variables are passed by value. + +If there are fewer arguments passed than there are names in @var{parameter-list}, +the extra names are given the null string as value. Extra names have the +effect of local variables. + +The open-parenthesis in a function call of a user-defined function must +immediately follow the function name, without any intervening white space. +This is to avoid a syntactic ambiguity with the concatenation operator. + +The word @code{func} may be used in place of @code{function} (but not in +@sc{posix} @code{awk}). + +Use the @code{return} statement to return a value from a function. + +@xref{User-defined, ,User-defined Functions}, for a more complete description. + +@node Historical Features, , Functions Summary, Gawk Summary +@appendixsec Historical Features + +There are two features of historical @code{awk} implementations that +@code{gawk} supports. First, it is possible to call the @code{length} +built-in function not only with no arguments, but even without parentheses! + +@example +a = length +@end example + +@noindent +is the same as either of + +@example +a = length() +a = length($0) +@end example + +@noindent +This feature is marked as ``deprecated'' in the @sc{posix} standard, and +@code{gawk} will issue a warning about its use if @samp{-W lint} is +specified on the command line. + +The other feature is the use of the @code{continue} statement outside the +body of a @code{while}, @code{for}, or @code{do} loop. Traditional +@code{awk} implementations have treated such usage as equivalent to the +@code{next} statement. @code{gawk} will support this usage if @samp{-W posix} +has not been specified. + +@node Sample Program, Bugs, Gawk Summary, Top +@appendix Sample Program + +The following example is a complete @code{awk} program, which prints +the number of occurrences of each word in its input. It illustrates the +associative nature of @code{awk} arrays by using strings as subscripts. It +also demonstrates the @samp{for @var{x} in @var{array}} construction. +Finally, it shows how @code{awk} can be used in conjunction with other +utility programs to do a useful task of some complexity with a minimum of +effort. Some explanations follow the program listing.@refill + +@example +awk ' +# Print list of word frequencies +@{ + for (i = 1; i <= NF; i++) + freq[$i]++ +@} + +END @{ + for (word in freq) + printf "%s\t%d\n", word, freq[word] +@}' +@end example + +The first thing to notice about this program is that it has two rules. The +first rule, because it has an empty pattern, is executed on every line of +the input. It uses @code{awk}'s field-accessing mechanism +(@pxref{Fields, ,Examining Fields}) to pick out the individual words from +the line, and the built-in variable @code{NF} (@pxref{Built-in Variables}) +to know how many fields are available.@refill + +For each input word, an element of the array @code{freq} is incremented to +reflect that the word has been seen an additional time.@refill + +The second rule, because it has the pattern @code{END}, is not executed +until the input has been exhausted. It prints out the contents of the +@code{freq} table that has been built up inside the first action.@refill + +Note that this program has several problems that would prevent it from being +useful by itself on real text files:@refill + +@itemize @bullet +@item +Words are detected using the @code{awk} convention that fields are +separated by whitespace and that other characters in the input (except +newlines) don't have any special meaning to @code{awk}. This means that +punctuation characters count as part of words.@refill + +@item +The @code{awk} language considers upper and lower case characters to be +distinct. Therefore, @samp{foo} and @samp{Foo} are not treated by this +program as the same word. This is undesirable since in normal text, words +are capitalized if they begin sentences, and a frequency analyzer should not +be sensitive to that.@refill + +@item +The output does not come out in any useful order. You're more likely to be +interested in which words occur most frequently, or having an alphabetized +table of how frequently each word occurs.@refill +@end itemize + +The way to solve these problems is to use some of the more advanced +features of the @code{awk} language. First, we use @code{tolower} to remove +case distinctions. Next, we use @code{gsub} to remove punctuation +characters. Finally, we use the system @code{sort} utility to process the +output of the @code{awk} script. First, here is the new version of +the program:@refill + +@example +awk ' +# Print list of word frequencies +@{ + $0 = tolower($0) # remove case distinctions + gsub(/[^a-z0-9_ \t]/, "", $0) # remove punctuation + for (i = 1; i <= NF; i++) + freq[$i]++ +@} + +END @{ + for (word in freq) + printf "%s\t%d\n", word, freq[word] +@}' +@end example + +Assuming we have saved this program in a file named @file{frequency.awk}, +and that the data is in @file{file1}, the following pipeline + +@example +awk -f frequency.awk file1 | sort +1 -nr +@end example + +@noindent +produces a table of the words appearing in @file{file1} in order of +decreasing frequency. + +The @code{awk} program suitably massages the data and produces a word +frequency table, which is not ordered. + +The @code{awk} script's output is then sorted by the @code{sort} command and +printed on the terminal. The options given to @code{sort} in this example +specify to sort using the second field of each input line (skipping one field), +that the sort keys should be treated as numeric quantities (otherwise +@samp{15} would come before @samp{5}), and that the sorting should be done +in descending (reverse) order.@refill + +We could have even done the @code{sort} from within the program, by +changing the @code{END} action to: + +@example +END @{ + sort = "sort +1 -nr" + for (word in freq) + printf "%s\t%d\n", word, freq[word] | sort + close(sort) +@}' +@end example + +See the general operating system documentation for more information on how +to use the @code{sort} command.@refill + +@ignore +@strong{ADR: I have some more substantial programs courtesy of Rick Adams +at UUNET. I am planning on incorporating those either in addition to or +instead of this program.} + +@strong{I would also like to incorporate the general @code{translate} +function that I have written.} + +@strong{I have a ton of other sample programs to include too.} +@end ignore + +@node Bugs, Notes, Sample Program, Top +@appendix Reporting Problems and Bugs + +@c This chapter stolen shamelessly from the GNU m4 manual. +@c This chapter has been unshamelessly altered to emulate changes made to +@c make.texi from whence it was originally shamelessly stolen! :-} --mew + +If you have problems with @code{gawk} or think that you have found a bug, +please report it to the developers; we cannot promise to do anything +but we might well want to fix it. + +Before reporting a bug, make sure you have actually found a real bug. +Carefully reread the documentation and see if it really says you can do +what you're trying to do. If it's not clear whether you should be able +to do something or not, report that too; it's a bug in the documentation! + +Before reporting a bug or trying to fix it yourself, try to isolate it +to the smallest possible @code{awk} program and input data file that +reproduces the problem. Then send us the program and data file, +some idea of what kind of Unix system you're using, and the exact results +@code{gawk} gave you. Also say what you expected to occur; this will help +us decide whether the problem was really in the documentation. + +Once you have a precise problem, send e-mail to (Internet) +@samp{bug-gnu-utils@@prep.ai.mit.edu} or (UUCP) +@samp{mit-eddie!prep.ai.mit.edu!bug-gnu-utils}. Please include the +version number of @code{gawk} you are using. You can get this information +with the command @samp{gawk -W version '@{@}' /dev/null}. +You should send carbon copies of your mail to David Trueman at +@samp{david@@cs.dal.ca}, and to Arnold Robbins, who can be reached at +@samp{arnold@@skeeve.atl.ga.us}. David is most likely to fix code +problems, while Arnold is most likely to fix documentation problems.@refill + +Non-bug suggestions are always welcome as well. If you have questions +about things that are unclear in the documentation or are just obscure +features, ask Arnold Robbins; he will try to help you out, although he +may not have the time to fix the problem. You can send him electronic mail at the Internet address +above. + +If you find bugs in one of the non-Unix ports of @code{gawk}, please send +an electronic mail message to the person who maintains that port. They +are listed below, and also in the @file{README} file in the @code{gawk} +distribution. Information in the @code{README} file should be considered +authoritative if it conflicts with this manual. + +The people maintaining the non-Unix ports of @code{gawk} are: + +@table @asis +@item MS-DOS +The port to MS-DOS is maintained by Scott Deifik. +His electronic mail address is @samp{scottd@@amgen.com}. + +@item VMS +The port to VAX VMS is maintained by Pat Rankin. +His electronic mail address is @samp{rankin@@eql.caltech.edu}. + +@item Atari ST +The port to the Atari ST is maintained by Michal Jaegermann. +His electronic mail address is @samp{ntomczak@@vm.ucs.ualberta.ca}. + +@end table + +If your bug is also reproducible under Unix, please send copies of your +report to the general GNU bug list, as well as to Arnold Robbins and David +Trueman, at the addresses listed above. + +@node Notes, Glossary, Bugs, Top +@appendix Implementation Notes + +This appendix contains information mainly of interest to implementors and +maintainers of @code{gawk}. Everything in it applies specifically to +@code{gawk}, and not to other implementations. + +@menu +* Compatibility Mode:: How to disable certain @code{gawk} extensions. +* Future Extensions:: New features we may implement soon. +* Improvements:: Suggestions for improvements by volunteers. +@end menu + +@node Compatibility Mode, Future Extensions, Notes, Notes +@appendixsec Downward Compatibility and Debugging + +@xref{POSIX/GNU, ,Extensions in @code{gawk} not in POSIX @code{awk}}, +for a summary of the GNU extensions to the @code{awk} language and program. +All of these features can be turned off by invoking @code{gawk} with the +@samp{-W compat} option, or with the @samp{-W posix} option.@refill + +If @code{gawk} is compiled for debugging with @samp{-DDEBUG}, then there +is one more option available on the command line: + +@table @samp +@item -W parsedebug +Print out the parse stack information as the program is being parsed. +@end table + +This option is intended only for serious @code{gawk} developers, +and not for the casual user. It probably has not even been compiled into +your version of @code{gawk}, since it slows down execution. + +@node Future Extensions, Improvements, Compatibility Mode, Notes +@appendixsec Probable Future Extensions + +This section briefly lists extensions that indicate the directions we are +currently considering for @code{gawk}. The file @file{FUTURES} in the +@code{gawk} distributions lists these extensions, as well as several others. + +@table @asis +@item @code{RS} as a regexp +The meaning of @code{RS} may be generalized along the lines of @code{FS}. + +@item Control of subprocess environment +Changes made in @code{gawk} to the array @code{ENVIRON} may be +propagated to subprocesses run by @code{gawk}. + +@item Databases +It may be possible to map a GDBM/NDBM/SDBM file into an @code{awk} array. + +@item Single-character fields +The null string, @code{""}, as a field separator, will cause field +splitting and the @code{split} function to separate individual characters. +Thus, @code{split(a, "abcd", "")} would yield @code{a[1] == "a"}, +@code{a[2] == "b"}, and so on. + +@item More @code{lint} warnings +There are more things that could be checked for portability. + +@item @code{RECLEN} variable for fixed length records +Along with @code{FIELDWIDTHS}, this would speed up the processing of +fixed-length records. + +@item @code{RT} variable to hold the record terminator +It is occasionally useful to have access to the actual string of +characters that matched the @code{RS} variable. The @code{RT} +variable would hold these characters. + +@item A @code{restart} keyword +After modifying @code{$0}, @code{restart} would restart the pattern +matching loop, without reading a new record from the input. + +@item A @samp{|&} redirection +The @samp{|&} redirection, in place of @samp{|}, would open a two-way +pipeline for communication with a sub-process (via @code{getline} and +@code{print} and @code{printf}). + +@item @code{IGNORECASE} affecting all comparisons +The effects of the @code{IGNORECASE} variable may be generalized to +all string comparisons, and not just regular expression operations. + +@item A way to mix command line source code and library files +There may be a new option that would make it possible to easily use library +functions from a program entered on the command line. +@c probably a @samp{-s} option... + +@item GNU-style long options +We will add GNU-style long options +to @code{gawk} for compatibility with other GNU programs. +(For example, @samp{--field-separator=:} would be equivalent to +@samp{-F:}.)@refill + +@c this is @emph{very} long term --- not worth including right now. +@ignore +@item The C Comma Operator +We may add the C comma operator, which takes the form +@code{@var{expr1},@var{expr2}}. The first expression is evaluated, and the +result is thrown away. The value of the full expression is the value of +@var{expr2}.@refill +@end ignore +@end table + +@node Improvements, , Future Extensions, Notes +@appendixsec Suggestions for Improvements + +Here are some projects that would-be @code{gawk} hackers might like to take +on. They vary in size from a few days to a few weeks of programming, +depending on which one you choose and how fast a programmer you are. Please +send any improvements you write to the maintainers at the GNU +project.@refill + +@enumerate +@item +Compilation of @code{awk} programs: @code{gawk} uses a Bison (YACC-like) +parser to convert the script given it into a syntax tree; the syntax +tree is then executed by a simple recursive evaluator. This method incurs +a lot of overhead, since the recursive evaluator performs many procedure +calls to do even the simplest things.@refill + +It should be possible for @code{gawk} to convert the script's parse tree +into a C program which the user would then compile, using the normal +C compiler and a special @code{gawk} library to provide all the needed +functions (regexps, fields, associative arrays, type coercion, and so +on).@refill + +An easier possibility might be for an intermediate phase of @code{awk} to +convert the parse tree into a linear byte code form like the one used +in GNU Emacs Lisp. The recursive evaluator would then be replaced by +a straight line byte code interpreter that would be intermediate in speed +between running a compiled program and doing what @code{gawk} does +now.@refill + +This may actually happen for the 3.0 version of @code{gawk}. + +@item +An error message section has not been included in this version of the +manual. Perhaps some nice beta testers will document some of the messages +for the future. + +@item +The programs in the test suite could use documenting in this manual. + +@item +The programs and data files in the manual should be available in +separate files to facilitate experimentation. + +@item +See the @file{FUTURES} file for more ideas. Contact us if you would +seriously like to tackle any of the items listed there. +@end enumerate + +@node Glossary, Index, Notes, Top +@appendix Glossary + +@table @asis +@item Action +A series of @code{awk} statements attached to a rule. If the rule's +pattern matches an input record, the @code{awk} language executes the +rule's action. Actions are always enclosed in curly braces. +@xref{Actions, ,Overview of Actions}.@refill + +@item Amazing @code{awk} Assembler +Henry Spencer at the University of Toronto wrote a retargetable assembler +completely as @code{awk} scripts. It is thousands of lines long, including +machine descriptions for several 8-bit microcomputers. +@c It is distributed with @code{gawk} (as part of the test suite) and +It is a good example of a +program that would have been better written in another language.@refill + +@item @sc{ansi} +The American National Standards Institute. This organization produces +many standards, among them the standard for the C programming language. + +@item Assignment +An @code{awk} expression that changes the value of some @code{awk} +variable or data object. An object that you can assign to is called an +@dfn{lvalue}. @xref{Assignment Ops, ,Assignment Expressions}.@refill + +@item @code{awk} Language +The language in which @code{awk} programs are written. + +@item @code{awk} Program +An @code{awk} program consists of a series of @dfn{patterns} and +@dfn{actions}, collectively known as @dfn{rules}. For each input record +given to the program, the program's rules are all processed in turn. +@code{awk} programs may also contain function definitions.@refill + +@item @code{awk} Script +Another name for an @code{awk} program. + +@item Built-in Function +The @code{awk} language provides built-in functions that perform various +numerical, time stamp related, and string computations. Examples are +@code{sqrt} (for the square root of a number) and @code{substr} (for a +substring of a string). @xref{Built-in, ,Built-in Functions}.@refill + +@item Built-in Variable +@code{ARGC}, @code{ARGIND}, @code{ARGV}, @code{CONVFMT}, @code{ENVIRON}, +@code{ERRNO}, @code{FIELDWIDTHS}, @code{FILENAME}, @code{FNR}, @code{FS}, +@code{IGNORECASE}, @code{NF}, @code{NR}, @code{OFMT}, @code{OFS}, @code{ORS}, +@code{RLENGTH}, @code{RSTART}, @code{RS}, and @code{SUBSEP}, +are the variables that have special +meaning to @code{awk}. Changing some of them affects @code{awk}'s running +environment. @xref{Built-in Variables}.@refill + +@item Braces +See ``Curly Braces.'' + +@item C +The system programming language that most GNU software is written in. The +@code{awk} programming language has C-like syntax, and this manual +points out similarities between @code{awk} and C when appropriate.@refill + +@item CHEM +A preprocessor for @code{pic} that reads descriptions of molecules +and produces @code{pic} input for drawing them. It was written by +Brian Kernighan, and is available from @code{netlib@@research.att.com}.@refill + +@item Compound Statement +A series of @code{awk} statements, enclosed in curly braces. Compound +statements may be nested. +@xref{Statements, ,Control Statements in Actions}.@refill + +@item Concatenation +Concatenating two strings means sticking them together, one after another, +giving a new string. For example, the string @samp{foo} concatenated with +the string @samp{bar} gives the string @samp{foobar}. +@xref{Concatenation, ,String Concatenation}.@refill + +@item Conditional Expression +An expression using the @samp{?:} ternary operator, such as +@code{@var{expr1} ? @var{expr2} : @var{expr3}}. The expression +@var{expr1} is evaluated; if the result is true, the value of the whole +expression is the value of @var{expr2} otherwise the value is +@var{expr3}. In either case, only one of @var{expr2} and @var{expr3} +is evaluated. @xref{Conditional Exp, ,Conditional Expressions}.@refill + +@item Constant Regular Expression +A constant regular expression is a regular expression written within +slashes, such as @samp{/foo/}. This regular expression is chosen +when you write the @code{awk} program, and cannot be changed doing +its execution. @xref{Regexp Usage, ,How to Use Regular Expressions}. + +@item Comparison Expression +A relation that is either true or false, such as @code{(a < b)}. +Comparison expressions are used in @code{if}, @code{while}, and @code{for} +statements, and in patterns to select which input records to process. +@xref{Comparison Ops, ,Comparison Expressions}.@refill + +@item Curly Braces +The characters @samp{@{} and @samp{@}}. Curly braces are used in +@code{awk} for delimiting actions, compound statements, and function +bodies.@refill + +@item Data Objects +These are numbers and strings of characters. Numbers are converted into +strings and vice versa, as needed. +@xref{Conversion, ,Conversion of Strings and Numbers}.@refill + +@item Dynamic Regular Expression +A dynamic regular expression is a regular expression written as an +ordinary expression. It could be a string constant, such as +@code{"foo"}, but it may also be an expression whose value may vary. +@xref{Regexp Usage, ,How to Use Regular Expressions}. + +@item Escape Sequences +A special sequence of characters used for describing nonprinting +characters, such as @samp{\n} for newline, or @samp{\033} for the ASCII +ESC (escape) character. @xref{Constants, ,Constant Expressions}. + +@item Field +When @code{awk} reads an input record, it splits the record into pieces +separated by whitespace (or by a separator regexp which you can +change by setting the built-in variable @code{FS}). Such pieces are +called fields. If the pieces are of fixed length, you can use the built-in +variable @code{FIELDWIDTHS} to describe their lengths. +@xref{Records, ,How Input is Split into Records}.@refill + +@item Format +Format strings are used to control the appearance of output in the +@code{printf} statement. Also, data conversions from numbers to strings +are controlled by the format string contained in the built-in variable +@code{CONVFMT}. @xref{Control Letters, ,Format-Control Letters}.@refill + +@item Function +A specialized group of statements often used to encapsulate general +or program-specific tasks. @code{awk} has a number of built-in +functions, and also allows you to define your own. +@xref{Built-in, ,Built-in Functions}. +Also, see @ref{User-defined, ,User-defined Functions}.@refill + +@item @code{gawk} +The GNU implementation of @code{awk}. + +@item GNU +``GNU's not Unix''. An on-going project of the Free Software Foundation +to create a complete, freely distributable, @sc{posix}-compliant computing +environment. + +@item Input Record +A single chunk of data read in by @code{awk}. Usually, an @code{awk} input +record consists of one line of text. +@xref{Records, ,How Input is Split into Records}.@refill + +@item Keyword +In the @code{awk} language, a keyword is a word that has special +meaning. Keywords are reserved and may not be used as variable names. + +@code{awk}'s keywords are: +@code{if}, +@code{else}, +@code{while}, +@code{do@dots{}while}, +@code{for}, +@code{for@dots{}in}, +@code{break}, +@code{continue}, +@code{delete}, +@code{next}, +@code{function}, +@code{func}, +and @code{exit}.@refill + +@item Lvalue +An expression that can appear on the left side of an assignment +operator. In most languages, lvalues can be variables or array +elements. In @code{awk}, a field designator can also be used as an +lvalue.@refill + +@item Number +A numeric valued data object. The @code{gawk} implementation uses double +precision floating point to represent numbers.@refill + +@item Pattern +Patterns tell @code{awk} which input records are interesting to which +rules. + +A pattern is an arbitrary conditional expression against which input is +tested. If the condition is satisfied, the pattern is said to @dfn{match} +the input record. A typical pattern might compare the input record against +a regular expression. @xref{Patterns}.@refill + +@item @sc{posix} +The name for a series of standards being developed by the @sc{ieee} +that specify a Portable Operating System interface. The ``IX'' denotes +the Unix heritage of these standards. The main standard of interest for +@code{awk} users is P1003.2, the Command Language and Utilities standard. + +@item Range (of input lines) +A sequence of consecutive lines from the input file. A pattern +can specify ranges of input lines for @code{awk} to process, or it can +specify single lines. @xref{Patterns}.@refill + +@item Recursion +When a function calls itself, either directly or indirectly. +If this isn't clear, refer to the entry for ``recursion.'' + +@item Redirection +Redirection means performing input from other than the standard input +stream, or output to other than the standard output stream. + +You can redirect the output of the @code{print} and @code{printf} statements +to a file or a system command, using the @samp{>}, @samp{>>}, and @samp{|} +operators. You can redirect input to the @code{getline} statement using +the @samp{<} and @samp{|} operators. +@xref{Redirection, ,Redirecting Output of @code{print} and @code{printf}}.@refill + +@item Regular Expression +See ``regexp.'' + +@item Regexp +Short for @dfn{regular expression}. A regexp is a pattern that denotes a +set of strings, possibly an infinite set. For example, the regexp +@samp{R.*xp} matches any string starting with the letter @samp{R} +and ending with the letters @samp{xp}. In @code{awk}, regexps are +used in patterns and in conditional expressions. Regexps may contain +escape sequences. @xref{Regexp, ,Regular Expressions as Patterns}.@refill + +@item Rule +A segment of an @code{awk} program, that specifies how to process single +input records. A rule consists of a @dfn{pattern} and an @dfn{action}. +@code{awk} reads an input record; then, for each rule, if the input record +satisfies the rule's pattern, @code{awk} executes the rule's action. +Otherwise, the rule does nothing for that input record.@refill + +@item Side Effect +A side effect occurs when an expression has an effect aside from merely +producing a value. Assignment expressions, increment expressions and +function calls have side effects. @xref{Assignment Ops, ,Assignment Expressions}. + +@item Special File +A file name interpreted internally by @code{gawk}, instead of being handed +directly to the underlying operating system. For example, @file{/dev/stdin}. +@xref{Special Files, ,Standard I/O Streams}. + +@item Stream Editor +A program that reads records from an input stream and processes them one +or more at a time. This is in contrast with batch programs, which may +expect to read their input files in entirety before starting to do +anything, and with interactive programs, which require input from the +user.@refill + +@item String +A datum consisting of a sequence of characters, such as @samp{I am a +string}. Constant strings are written with double-quotes in the +@code{awk} language, and may contain escape sequences. +@xref{Constants, ,Constant Expressions}. + +@item Whitespace +A sequence of blank or tab characters occurring inside an input record or a +string.@refill +@end table + +@node Index, , Glossary, Top +@unnumbered Index +@printindex cp + +@summarycontents +@contents +@bye + +Unresolved Issues: +------------------ +1. From: ntomczak@vm.ucs.ualberta.ca (Michal Jaegermann) + Examples of usage tend to suggest that /../ and ".." delimiters + can be used for regular expressions, even if definition is consistently + using /../. I am not sure what the real rules are and in particular + what of the following is a bug and what is a feature: + # This program matches everything + '"\(" { print }' + # This one complains about mismatched parenthesis + '$0 ~ "\(" { print }' + # This one behaves in an expected manner + '/\(/ { print }' + You may also try to use "\(" as an argument to match() to see what + will happen. + +2. From ADR. + + The posix (and original Unix!) notion of awk values as both number + and string values needs to be put into the manual. This involves + major and minor rewrites of most of the manual, but should help in + clarifying many of the weirder points of the language. + +3. From ADR. + + The manual should be reorganized. Expressions should be introduced + early, building up to regexps as expressions, and from there to their + use as patterns and then in actions. Built-in vars should come earlier + in the manual too. The 'expert info' sections marked with comments + should get their own sections or subsections with nodes and titles. + The manual should be gone over thoroughly for indexing. + +4. From ADR. + + Robert J. Chassell points out that awk programs should have some indication + of how to use them. It would be useful to perhaps have a "programming + style" section of the manual that would include this and other tips. + +5. From ADR in response to moraes@uunet.ca + (This would make the beginnings of a good "puzzles" section...) + + Date: Mon, 2 Dec 91 10:08:05 EST + From: gatech!cc!arnold (Arnold Robbins) + To: cs.dal.ca!david, uunet.ca!moraes + Subject: redirecting to /dev/stderr + Cc: skeeve!arnold, boeing.com!brennan, research.att.com!bwk + + In 2.13.3 the following program no longer dumps core: + + BEGIN { print "hello" > /dev/stderr ; exit(1) } + + Instead, it creates a file named `0' with the word `hello' in it. AWK + semantics strikes again. The meaning of the statement is + + print "hello" > (($0 ~ /dev/) stderr) + + /dev/ tests $0 for the pattern `dev'. This yields a 0. The variable stderr, + having never been used, has a null string in it. The concatenation yields + a string value of "0" which is used as the file name. Sigh. + + I think with some more time I can come up with a decent fix, but it will + probably only print a diagnostic with -Wlint. + + Arnold + diff --git a/gnu/usr.bin/awk/getopt.c b/gnu/usr.bin/awk/getopt.c new file mode 100644 index 000000000000..bbf345c33ca2 --- /dev/null +++ b/gnu/usr.bin/awk/getopt.c @@ -0,0 +1,662 @@ +/* Getopt for GNU. + NOTE: getopt is now part of the C library, so if you don't know what + "Keep this file name-space clean" means, talk to roland@gnu.ai.mit.edu + before changing it! + + Copyright (C) 1987, 88, 89, 90, 91, 1992 Free Software Foundation, Inc. + + This program is free software; you can redistribute it and/or modify it + under the terms of the GNU Library General Public License as published + by the Free Software Foundation; either version 2, or (at your option) + any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU Library General Public + License along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ + +#ifdef GAWK +#include "config.h" +#endif + +#include <stdio.h> + +/* This needs to come after some library #include + to get __GNU_LIBRARY__ defined. */ +#ifdef __GNU_LIBRARY__ +#include <stdlib.h> +#include <string.h> +#endif /* GNU C library. */ + + +#ifndef __STDC__ +#define const +#endif + +/* If GETOPT_COMPAT is defined, `+' as well as `--' can introduce a + long-named option. Because this is not POSIX.2 compliant, it is + being phased out. */ +#define GETOPT_COMPAT + +/* This version of `getopt' appears to the caller like standard Unix `getopt' + but it behaves differently for the user, since it allows the user + to intersperse the options with the other arguments. + + As `getopt' works, it permutes the elements of ARGV so that, + when it is done, all the options precede everything else. Thus + all application programs are extended to handle flexible argument order. + + Setting the environment variable POSIXLY_CORRECT disables permutation. + Then the behavior is completely standard. + + GNU application programs can use a third alternative mode in which + they can distinguish the relative order of options and other arguments. */ + +#include "getopt.h" + +/* For communication from `getopt' to the caller. + When `getopt' finds an option that takes an argument, + the argument value is returned here. + Also, when `ordering' is RETURN_IN_ORDER, + each non-option ARGV-element is returned here. */ + +char *optarg = 0; + +/* Index in ARGV of the next element to be scanned. + This is used for communication to and from the caller + and for communication between successive calls to `getopt'. + + On entry to `getopt', zero means this is the first call; initialize. + + When `getopt' returns EOF, this is the index of the first of the + non-option elements that the caller should itself scan. + + Otherwise, `optind' communicates from one call to the next + how much of ARGV has been scanned so far. */ + +int optind = 0; + +/* The next char to be scanned in the option-element + in which the last option character we returned was found. + This allows us to pick up the scan where we left off. + + If this is zero, or a null string, it means resume the scan + by advancing to the next ARGV-element. */ + +static char *nextchar; + +/* Callers store zero here to inhibit the error message + for unrecognized options. */ + +int opterr = 1; + +/* Describe how to deal with options that follow non-option ARGV-elements. + + If the caller did not specify anything, + the default is REQUIRE_ORDER if the environment variable + POSIXLY_CORRECT is defined, PERMUTE otherwise. + + REQUIRE_ORDER means don't recognize them as options; + stop option processing when the first non-option is seen. + This is what Unix does. + This mode of operation is selected by either setting the environment + variable POSIXLY_CORRECT, or using `+' as the first character + of the list of option characters. + + PERMUTE is the default. We permute the contents of ARGV as we scan, + so that eventually all the non-options are at the end. This allows options + to be given in any order, even with programs that were not written to + expect this. + + RETURN_IN_ORDER is an option available to programs that were written + to expect options and other ARGV-elements in any order and that care about + the ordering of the two. We describe each non-option ARGV-element + as if it were the argument of an option with character code 1. + Using `-' as the first character of the list of option characters + selects this mode of operation. + + The special argument `--' forces an end of option-scanning regardless + of the value of `ordering'. In the case of RETURN_IN_ORDER, only + `--' can cause `getopt' to return EOF with `optind' != ARGC. */ + +static enum +{ + REQUIRE_ORDER, PERMUTE, RETURN_IN_ORDER +} ordering; + +#ifdef __GNU_LIBRARY__ +#include <string.h> +#define my_index strchr +#define my_bcopy(src, dst, n) memcpy ((dst), (src), (n)) +#else + +/* Avoid depending on library functions or files + whose names are inconsistent. */ + +char *getenv (); + +static char * +my_index (string, chr) + char *string; + int chr; +{ + while (*string) + { + if (*string == chr) + return string; + string++; + } + return 0; +} + +static void +my_bcopy (from, to, size) + char *from, *to; + int size; +{ + int i; + for (i = 0; i < size; i++) + to[i] = from[i]; +} +#endif /* GNU C library. */ + +/* Handle permutation of arguments. */ + +/* Describe the part of ARGV that contains non-options that have + been skipped. `first_nonopt' is the index in ARGV of the first of them; + `last_nonopt' is the index after the last of them. */ + +static int first_nonopt; +static int last_nonopt; + +/* Exchange two adjacent subsequences of ARGV. + One subsequence is elements [first_nonopt,last_nonopt) + which contains all the non-options that have been skipped so far. + The other is elements [last_nonopt,optind), which contains all + the options processed since those non-options were skipped. + + `first_nonopt' and `last_nonopt' are relocated so that they describe + the new indices of the non-options in ARGV after they are moved. */ + +static void +exchange (argv) + char **argv; +{ + int nonopts_size = (last_nonopt - first_nonopt) * sizeof (char *); + char **temp = (char **) malloc (nonopts_size); + + /* Interchange the two blocks of data in ARGV. */ + + my_bcopy (&argv[first_nonopt], temp, nonopts_size); + my_bcopy (&argv[last_nonopt], &argv[first_nonopt], + (optind - last_nonopt) * sizeof (char *)); + my_bcopy (temp, &argv[first_nonopt + optind - last_nonopt], nonopts_size); + + free(temp); + + /* Update records for the slots the non-options now occupy. */ + + first_nonopt += (optind - last_nonopt); + last_nonopt = optind; +} + +/* Scan elements of ARGV (whose length is ARGC) for option characters + given in OPTSTRING. + + If an element of ARGV starts with '-', and is not exactly "-" or "--", + then it is an option element. The characters of this element + (aside from the initial '-') are option characters. If `getopt' + is called repeatedly, it returns successively each of the option characters + from each of the option elements. + + If `getopt' finds another option character, it returns that character, + updating `optind' and `nextchar' so that the next call to `getopt' can + resume the scan with the following option character or ARGV-element. + + If there are no more option characters, `getopt' returns `EOF'. + Then `optind' is the index in ARGV of the first ARGV-element + that is not an option. (The ARGV-elements have been permuted + so that those that are not options now come last.) + + OPTSTRING is a string containing the legitimate option characters. + If an option character is seen that is not listed in OPTSTRING, + return '?' after printing an error message. If you set `opterr' to + zero, the error message is suppressed but we still return '?'. + + If a char in OPTSTRING is followed by a colon, that means it wants an arg, + so the following text in the same ARGV-element, or the text of the following + ARGV-element, is returned in `optarg'. Two colons mean an option that + wants an optional arg; if there is text in the current ARGV-element, + it is returned in `optarg', otherwise `optarg' is set to zero. + + If OPTSTRING starts with `-' or `+', it requests different methods of + handling the non-option ARGV-elements. + See the comments about RETURN_IN_ORDER and REQUIRE_ORDER, above. + + Long-named options begin with `--' instead of `-'. + Their names may be abbreviated as long as the abbreviation is unique + or is an exact match for some defined option. If they have an + argument, it follows the option name in the same ARGV-element, separated + from the option name by a `=', or else the in next ARGV-element. + When `getopt' finds a long-named option, it returns 0 if that option's + `flag' field is nonzero, the value of the option's `val' field + if the `flag' field is zero. + + The elements of ARGV aren't really const, because we permute them. + But we pretend they're const in the prototype to be compatible + with other systems. + + LONGOPTS is a vector of `struct option' terminated by an + element containing a name which is zero. + + LONGIND returns the index in LONGOPT of the long-named option found. + It is only valid when a long-named option has been found by the most + recent call. + + If LONG_ONLY is nonzero, '-' as well as '--' can introduce + long-named options. */ + +int +_getopt_internal (argc, argv, optstring, longopts, longind, long_only) + int argc; + char *const *argv; + const char *optstring; + const struct option *longopts; + int *longind; + int long_only; +{ + int option_index; + + optarg = 0; + + /* Initialize the internal data when the first call is made. + Start processing options with ARGV-element 1 (since ARGV-element 0 + is the program name); the sequence of previously skipped + non-option ARGV-elements is empty. */ + + if (optind == 0) + { + first_nonopt = last_nonopt = optind = 1; + + nextchar = NULL; + + /* Determine how to handle the ordering of options and nonoptions. */ + + if (optstring[0] == '-') + { + ordering = RETURN_IN_ORDER; + ++optstring; + } + else if (optstring[0] == '+') + { + ordering = REQUIRE_ORDER; + ++optstring; + } + else if (getenv ("POSIXLY_CORRECT") != NULL) + ordering = REQUIRE_ORDER; + else + ordering = PERMUTE; + } + + if (nextchar == NULL || *nextchar == '\0') + { + if (ordering == PERMUTE) + { + /* If we have just processed some options following some non-options, + exchange them so that the options come first. */ + + if (first_nonopt != last_nonopt && last_nonopt != optind) + exchange ((char **) argv); + else if (last_nonopt != optind) + first_nonopt = optind; + + /* Now skip any additional non-options + and extend the range of non-options previously skipped. */ + + while (optind < argc + && (argv[optind][0] != '-' || argv[optind][1] == '\0') +#ifdef GETOPT_COMPAT + && (longopts == NULL + || argv[optind][0] != '+' || argv[optind][1] == '\0') +#endif /* GETOPT_COMPAT */ + ) + optind++; + last_nonopt = optind; + } + + /* Special ARGV-element `--' means premature end of options. + Skip it like a null option, + then exchange with previous non-options as if it were an option, + then skip everything else like a non-option. */ + + if (optind != argc && !strcmp (argv[optind], "--")) + { + optind++; + + if (first_nonopt != last_nonopt && last_nonopt != optind) + exchange ((char **) argv); + else if (first_nonopt == last_nonopt) + first_nonopt = optind; + last_nonopt = argc; + + optind = argc; + } + + /* If we have done all the ARGV-elements, stop the scan + and back over any non-options that we skipped and permuted. */ + + if (optind == argc) + { + /* Set the next-arg-index to point at the non-options + that we previously skipped, so the caller will digest them. */ + if (first_nonopt != last_nonopt) + optind = first_nonopt; + return EOF; + } + + /* If we have come to a non-option and did not permute it, + either stop the scan or describe it to the caller and pass it by. */ + + if ((argv[optind][0] != '-' || argv[optind][1] == '\0') +#ifdef GETOPT_COMPAT + && (longopts == NULL + || argv[optind][0] != '+' || argv[optind][1] == '\0') +#endif /* GETOPT_COMPAT */ + ) + { + if (ordering == REQUIRE_ORDER) + return EOF; + optarg = argv[optind++]; + return 1; + } + + /* We have found another option-ARGV-element. + Start decoding its characters. */ + + nextchar = (argv[optind] + 1 + + (longopts != NULL && argv[optind][1] == '-')); + } + + if (longopts != NULL + && ((argv[optind][0] == '-' + && (argv[optind][1] == '-' || long_only)) +#ifdef GETOPT_COMPAT + || argv[optind][0] == '+' +#endif /* GETOPT_COMPAT */ + )) + { + const struct option *p; + char *s = nextchar; + int exact = 0; + int ambig = 0; + const struct option *pfound = NULL; + int indfound = 0; + extern int strncmp(); + + while (*s && *s != '=') + s++; + + /* Test all options for either exact match or abbreviated matches. */ + for (p = longopts, option_index = 0; p->name; + p++, option_index++) + if (!strncmp (p->name, nextchar, s - nextchar)) + { + if (s - nextchar == strlen (p->name)) + { + /* Exact match found. */ + pfound = p; + indfound = option_index; + exact = 1; + break; + } + else if (pfound == NULL) + { + /* First nonexact match found. */ + pfound = p; + indfound = option_index; + } + else + /* Second nonexact match found. */ + ambig = 1; + } + + if (ambig && !exact) + { + if (opterr) + fprintf (stderr, "%s: option `%s' is ambiguous\n", + argv[0], argv[optind]); + nextchar += strlen (nextchar); + optind++; + return '?'; + } + + if (pfound != NULL) + { + option_index = indfound; + optind++; + if (*s) + { + /* Don't test has_arg with >, because some C compilers don't + allow it to be used on enums. */ + if (pfound->has_arg) + optarg = s + 1; + else + { + if (opterr) + { + if (argv[optind - 1][1] == '-') + /* --option */ + fprintf (stderr, + "%s: option `--%s' doesn't allow an argument\n", + argv[0], pfound->name); + else + /* +option or -option */ + fprintf (stderr, + "%s: option `%c%s' doesn't allow an argument\n", + argv[0], argv[optind - 1][0], pfound->name); + } + nextchar += strlen (nextchar); + return '?'; + } + } + else if (pfound->has_arg == 1) + { + if (optind < argc) + optarg = argv[optind++]; + else + { + if (opterr) + fprintf (stderr, "%s: option `%s' requires an argument\n", + argv[0], argv[optind - 1]); + nextchar += strlen (nextchar); + return '?'; + } + } + nextchar += strlen (nextchar); + if (longind != NULL) + *longind = option_index; + if (pfound->flag) + { + *(pfound->flag) = pfound->val; + return 0; + } + return pfound->val; + } + /* Can't find it as a long option. If this is not getopt_long_only, + or the option starts with '--' or is not a valid short + option, then it's an error. + Otherwise interpret it as a short option. */ + if (!long_only || argv[optind][1] == '-' +#ifdef GETOPT_COMPAT + || argv[optind][0] == '+' +#endif /* GETOPT_COMPAT */ + || my_index (optstring, *nextchar) == NULL) + { + if (opterr) + { + if (argv[optind][1] == '-') + /* --option */ + fprintf (stderr, "%s: unrecognized option `--%s'\n", + argv[0], nextchar); + else + /* +option or -option */ + fprintf (stderr, "%s: unrecognized option `%c%s'\n", + argv[0], argv[optind][0], nextchar); + } + nextchar = (char *) ""; + optind++; + return '?'; + } + } + + /* Look at and handle the next option-character. */ + + { + char c = *nextchar++; + char *temp = my_index (optstring, c); + + /* Increment `optind' when we start to process its last character. */ + if (*nextchar == '\0') + ++optind; + + if (temp == NULL || c == ':') + { + if (opterr) + { + if (c < 040 || c >= 0177) + fprintf (stderr, "%s: unrecognized option, character code 0%o\n", + argv[0], c); + else + fprintf (stderr, "%s: unrecognized option `-%c'\n", argv[0], c); + } + return '?'; + } + if (temp[1] == ':') + { + if (temp[2] == ':') + { + /* This is an option that accepts an argument optionally. */ + if (*nextchar != '\0') + { + optarg = nextchar; + optind++; + } + else + optarg = 0; + nextchar = NULL; + } + else + { + /* This is an option that requires an argument. */ + if (*nextchar != '\0') + { + optarg = nextchar; + /* If we end this ARGV-element by taking the rest as an arg, + we must advance to the next element now. */ + optind++; + } + else if (optind == argc) + { + if (opterr) + fprintf (stderr, "%s: option `-%c' requires an argument\n", + argv[0], c); + c = '?'; + } + else + /* We already incremented `optind' once; + increment it again when taking next ARGV-elt as argument. */ + optarg = argv[optind++]; + nextchar = NULL; + } + } + return c; + } +} + +int +getopt (argc, argv, optstring) + int argc; + char *const *argv; + const char *optstring; +{ + return _getopt_internal (argc, argv, optstring, + (const struct option *) 0, + (int *) 0, + 0); +} + +#ifdef TEST + +/* Compile with -DTEST to make an executable for use in testing + the above definition of `getopt'. */ + +int +main (argc, argv) + int argc; + char **argv; +{ + int c; + int digit_optind = 0; + + while (1) + { + int this_option_optind = optind ? optind : 1; + + c = getopt (argc, argv, "abc:d:0123456789"); + if (c == EOF) + break; + + switch (c) + { + case '0': + case '1': + case '2': + case '3': + case '4': + case '5': + case '6': + case '7': + case '8': + case '9': + if (digit_optind != 0 && digit_optind != this_option_optind) + printf ("digits occur in two different argv-elements.\n"); + digit_optind = this_option_optind; + printf ("option %c\n", c); + break; + + case 'a': + printf ("option a\n"); + break; + + case 'b': + printf ("option b\n"); + break; + + case 'c': + printf ("option c with value `%s'\n", optarg); + break; + + case '?': + break; + + default: + printf ("?? getopt returned character code 0%o ??\n", c); + } + } + + if (optind < argc) + { + printf ("non-option ARGV-elements: "); + while (optind < argc) + printf ("%s ", argv[optind++]); + printf ("\n"); + } + + exit (0); +} + +#endif /* TEST */ diff --git a/gnu/usr.bin/awk/getopt.h b/gnu/usr.bin/awk/getopt.h new file mode 100644 index 000000000000..de027434f7cb --- /dev/null +++ b/gnu/usr.bin/awk/getopt.h @@ -0,0 +1,128 @@ +/* Declarations for getopt. + Copyright (C) 1989, 1990, 1991, 1992 Free Software Foundation, Inc. + + This program is free software; you can redistribute it and/or modify it + under the terms of the GNU Library General Public License as published + by the Free Software Foundation; either version 2, or (at your option) + any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU Library General Public + License along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ + +#ifndef _GETOPT_H +#define _GETOPT_H 1 + +#ifdef __cplusplus +extern "C" { +#endif + +/* For communication from `getopt' to the caller. + When `getopt' finds an option that takes an argument, + the argument value is returned here. + Also, when `ordering' is RETURN_IN_ORDER, + each non-option ARGV-element is returned here. */ + +extern char *optarg; + +/* Index in ARGV of the next element to be scanned. + This is used for communication to and from the caller + and for communication between successive calls to `getopt'. + + On entry to `getopt', zero means this is the first call; initialize. + + When `getopt' returns EOF, this is the index of the first of the + non-option elements that the caller should itself scan. + + Otherwise, `optind' communicates from one call to the next + how much of ARGV has been scanned so far. */ + +extern int optind; + +/* Callers store zero here to inhibit the error message `getopt' prints + for unrecognized options. */ + +extern int opterr; + +/* Describe the long-named options requested by the application. + The LONG_OPTIONS argument to getopt_long or getopt_long_only is a vector + of `struct option' terminated by an element containing a name which is + zero. + + The field `has_arg' is: + no_argument (or 0) if the option does not take an argument, + required_argument (or 1) if the option requires an argument, + optional_argument (or 2) if the option takes an optional argument. + + If the field `flag' is not NULL, it points to a variable that is set + to the value given in the field `val' when the option is found, but + left unchanged if the option is not found. + + To have a long-named option do something other than set an `int' to + a compiled-in constant, such as set a value from `optarg', set the + option's `flag' field to zero and its `val' field to a nonzero + value (the equivalent single-letter option character, if there is + one). For long options that have a zero `flag' field, `getopt' + returns the contents of the `val' field. */ + +struct option +{ +#if __STDC__ + const char *name; +#else + char *name; +#endif + /* has_arg can't be an enum because some compilers complain about + type mismatches in all the code that assumes it is an int. */ + int has_arg; + int *flag; + int val; +}; + +/* Names for the values of the `has_arg' field of `struct option'. */ + +enum _argtype +{ + no_argument, + required_argument, + optional_argument +}; + +#if __STDC__ +#if defined(__GNU_LIBRARY__) +/* Many other libraries have conflicting prototypes for getopt, with + differences in the consts, in stdlib.h. To avoid compilation + errors, only prototype getopt for the GNU C library. */ +extern int getopt (int argc, char *const *argv, const char *shortopts); +#else /* not __GNU_LIBRARY__ */ +extern int getopt (); +#endif /* not __GNU_LIBRARY__ */ +extern int getopt_long (int argc, char *const *argv, const char *shortopts, + const struct option *longopts, int *longind); +extern int getopt_long_only (int argc, char *const *argv, + const char *shortopts, + const struct option *longopts, int *longind); + +/* Internal only. Users should not call this directly. */ +extern int _getopt_internal (int argc, char *const *argv, + const char *shortopts, + const struct option *longopts, int *longind, + int long_only); +#else /* not __STDC__ */ +extern int getopt (); +extern int getopt_long (); +extern int getopt_long_only (); + +extern int _getopt_internal (); +#endif /* not __STDC__ */ + +#ifdef __cplusplus +} +#endif + +#endif /* _GETOPT_H */ diff --git a/gnu/usr.bin/awk/getopt1.c b/gnu/usr.bin/awk/getopt1.c new file mode 100644 index 000000000000..e2127cd58d42 --- /dev/null +++ b/gnu/usr.bin/awk/getopt1.c @@ -0,0 +1,160 @@ +/* Getopt for GNU. + Copyright (C) 1987, 88, 89, 90, 91, 1992 Free Software Foundation, Inc. + +This file is part of the libiberty library. +Libiberty is free software; you can redistribute it and/or +modify it under the terms of the GNU Library General Public +License as published by the Free Software Foundation; either +version 2 of the License, or (at your option) any later version. + +Libiberty is distributed in the hope that it will be useful, +but WITHOUT ANY WARRANTY; without even the implied warranty of +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU +Library General Public License for more details. + +You should have received a copy of the GNU Library General Public +License along with libiberty; see the file COPYING.LIB. If +not, write to the Free Software Foundation, Inc., 675 Mass Ave, +Cambridge, MA 02139, USA. */ + +#ifdef LIBC +/* For when compiled as part of the GNU C library. */ +#include <ansidecl.h> +#endif + +#include "getopt.h" + +#ifndef __STDC__ +#define const +#endif + +#if defined(STDC_HEADERS) || defined(__GNU_LIBRARY__) || defined (LIBC) +#include <stdlib.h> +#else /* STDC_HEADERS or __GNU_LIBRARY__ */ +char *getenv (); +#endif /* STDC_HEADERS or __GNU_LIBRARY__ */ + +#if !defined (NULL) +#define NULL 0 +#endif + +int +getopt_long (argc, argv, options, long_options, opt_index) + int argc; + char *const *argv; + const char *options; + const struct option *long_options; + int *opt_index; +{ + return _getopt_internal (argc, argv, options, long_options, opt_index, 0); +} + +/* Like getopt_long, but '-' as well as '--' can indicate a long option. + If an option that starts with '-' (not '--') doesn't match a long option, + but does match a short option, it is parsed as a short option + instead. */ + +int +getopt_long_only (argc, argv, options, long_options, opt_index) + int argc; + char *const *argv; + const char *options; + const struct option *long_options; + int *opt_index; +{ + return _getopt_internal (argc, argv, options, long_options, opt_index, 1); +} + +#ifdef TEST + +#include <stdio.h> + +int +main (argc, argv) + int argc; + char **argv; +{ + int c; + int digit_optind = 0; + + while (1) + { + int this_option_optind = optind ? optind : 1; + int option_index = 0; + static struct option long_options[] = + { + {"add", 1, 0, 0}, + {"append", 0, 0, 0}, + {"delete", 1, 0, 0}, + {"verbose", 0, 0, 0}, + {"create", 0, 0, 0}, + {"file", 1, 0, 0}, + {0, 0, 0, 0} + }; + + c = getopt_long (argc, argv, "abc:d:0123456789", + long_options, &option_index); + if (c == EOF) + break; + + switch (c) + { + case 0: + printf ("option %s", long_options[option_index].name); + if (optarg) + printf (" with arg %s", optarg); + printf ("\n"); + break; + + case '0': + case '1': + case '2': + case '3': + case '4': + case '5': + case '6': + case '7': + case '8': + case '9': + if (digit_optind != 0 && digit_optind != this_option_optind) + printf ("digits occur in two different argv-elements.\n"); + digit_optind = this_option_optind; + printf ("option %c\n", c); + break; + + case 'a': + printf ("option a\n"); + break; + + case 'b': + printf ("option b\n"); + break; + + case 'c': + printf ("option c with value `%s'\n", optarg); + break; + + case 'd': + printf ("option d with value `%s'\n", optarg); + break; + + case '?': + break; + + default: + printf ("?? getopt returned character code 0%o ??\n", c); + } + } + + if (optind < argc) + { + printf ("non-option ARGV-elements: "); + while (optind < argc) + printf ("%s ", argv[optind++]); + printf ("\n"); + } + + exit (0); +} + +#endif /* TEST */ diff --git a/gnu/usr.bin/awk/io.c b/gnu/usr.bin/awk/io.c new file mode 100644 index 000000000000..7004aedd519d --- /dev/null +++ b/gnu/usr.bin/awk/io.c @@ -0,0 +1,1207 @@ +/* + * io.c --- routines for dealing with input and output and records + */ + +/* + * Copyright (C) 1986, 1988, 1989, 1991, 1992 the Free Software Foundation, Inc. + * + * This file is part of GAWK, the GNU implementation of the + * AWK Progamming Language. + * + * GAWK is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * GAWK is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with GAWK; see the file COPYING. If not, write to + * the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +#include "awk.h" + +#ifndef O_RDONLY +#include <fcntl.h> +#endif + +#if !defined(S_ISDIR) && defined(S_IFDIR) +#define S_ISDIR(m) (((m) & S_IFMT) == S_IFDIR) +#endif + +#ifndef atarist +#define INVALID_HANDLE (-1) +#else +#define INVALID_HANDLE (__SMALLEST_VALID_HANDLE - 1) +#endif + +#if defined(MSDOS) || defined(atarist) +#define PIPES_SIMULATED +#endif + +static IOBUF *nextfile P((int skipping)); +static int inrec P((IOBUF *iop)); +static int iop_close P((IOBUF *iop)); +struct redirect *redirect P((NODE *tree, int *errflg)); +static void close_one P((void)); +static int close_redir P((struct redirect *rp)); +#ifndef PIPES_SIMULATED +static int wait_any P((int interesting)); +#endif +static IOBUF *gawk_popen P((char *cmd, struct redirect *rp)); +static IOBUF *iop_open P((char *file, char *how)); +static int gawk_pclose P((struct redirect *rp)); +static int do_pathopen P((char *file)); + +extern FILE *fdopen(); +extern FILE *popen(); + +static struct redirect *red_head = NULL; + +extern int output_is_tty; +extern NODE *ARGC_node; +extern NODE *ARGV_node; +extern NODE *ARGIND_node; +extern NODE *ERRNO_node; +extern NODE **fields_arr; + +static jmp_buf filebuf; /* for do_nextfile() */ + +/* do_nextfile --- implement gawk "next file" extension */ + +void +do_nextfile() +{ + (void) nextfile(1); + longjmp(filebuf, 1); +} + +static IOBUF * +nextfile(skipping) +int skipping; +{ + static int i = 1; + static int files = 0; + NODE *arg; + int fd = INVALID_HANDLE; + static IOBUF *curfile = NULL; + + if (skipping) { + if (curfile != NULL) + iop_close(curfile); + curfile = NULL; + return NULL; + } + if (curfile != NULL) { + if (curfile->cnt == EOF) { + (void) iop_close(curfile); + curfile = NULL; + } else + return curfile; + } + for (; i < (int) (ARGC_node->lnode->numbr); i++) { + arg = *assoc_lookup(ARGV_node, tmp_number((AWKNUM) i)); + if (arg->stptr[0] == '\0') + continue; + arg->stptr[arg->stlen] = '\0'; + if (! do_unix) { + ARGIND_node->var_value->numbr = i; + ARGIND_node->var_value->flags = NUM|NUMBER; + } + if (!arg_assign(arg->stptr)) { + files++; + curfile = iop_open(arg->stptr, "r"); + if (curfile == NULL) + fatal("cannot open file `%s' for reading (%s)", + arg->stptr, strerror(errno)); + /* NOTREACHED */ + /* This is a kludge. */ + unref(FILENAME_node->var_value); + FILENAME_node->var_value = + dupnode(arg); + FNR = 0; + i++; + break; + } + } + if (files == 0) { + files++; + /* no args. -- use stdin */ + /* FILENAME is init'ed to "-" */ + /* FNR is init'ed to 0 */ + curfile = iop_alloc(fileno(stdin)); + } + return curfile; +} + +void +set_FNR() +{ + FNR = (int) FNR_node->var_value->numbr; +} + +void +set_NR() +{ + NR = (int) NR_node->var_value->numbr; +} + +/* + * This reads in a record from the input file + */ +static int +inrec(iop) +IOBUF *iop; +{ + char *begin; + register int cnt; + int retval = 0; + + cnt = get_a_record(&begin, iop, *RS, NULL); + if (cnt == EOF) { + cnt = 0; + retval = 1; + } else { + NR += 1; + FNR += 1; + } + set_record(begin, cnt, 1); + + return retval; +} + +static int +iop_close(iop) +IOBUF *iop; +{ + int ret; + + if (iop == NULL) + return 0; + errno = 0; + +#ifdef _CRAY + /* Work around bug in UNICOS popen */ + if (iop->fd < 3) + ret = 0; + else +#endif + /* save these for re-use; don't free the storage */ + if ((iop->flag & IOP_IS_INTERNAL) != 0) { + iop->off = iop->buf; + iop->end = iop->buf + strlen(iop->buf); + iop->cnt = 0; + iop->secsiz = 0; + return 0; + } + + /* Don't close standard files or else crufty code elsewhere will lose */ + if (iop->fd == fileno(stdin) || + iop->fd == fileno(stdout) || + iop->fd == fileno(stderr)) + ret = 0; + else + ret = close(iop->fd); + if (ret == -1) + warning("close of fd %d failed (%s)", iop->fd, strerror(errno)); + if ((iop->flag & IOP_NO_FREE) == 0) { + /* + * be careful -- $0 may still reference the buffer even though + * an explicit close is being done; in the future, maybe we + * can do this a bit better + */ + if (iop->buf) { + if ((fields_arr[0]->stptr >= iop->buf) + && (fields_arr[0]->stptr < iop->end)) { + NODE *t; + + t = make_string(fields_arr[0]->stptr, + fields_arr[0]->stlen); + unref(fields_arr[0]); + fields_arr [0] = t; + reset_record (); + } + free(iop->buf); + } + free((char *)iop); + } + return ret == -1 ? 1 : 0; +} + +void +do_input() +{ + IOBUF *iop; + extern int exiting; + + if (setjmp(filebuf) != 0) { + } + while ((iop = nextfile(0)) != NULL) { + if (inrec(iop) == 0) + while (interpret(expression_value) && inrec(iop) == 0) + ; + if (exiting) + break; + } +} + +/* Redirection for printf and print commands */ +struct redirect * +redirect(tree, errflg) +NODE *tree; +int *errflg; +{ + register NODE *tmp; + register struct redirect *rp; + register char *str; + int tflag = 0; + int outflag = 0; + char *direction = "to"; + char *mode; + int fd; + char *what = NULL; + + switch (tree->type) { + case Node_redirect_append: + tflag = RED_APPEND; + /* FALL THROUGH */ + case Node_redirect_output: + outflag = (RED_FILE|RED_WRITE); + tflag |= outflag; + if (tree->type == Node_redirect_output) + what = ">"; + else + what = ">>"; + break; + case Node_redirect_pipe: + tflag = (RED_PIPE|RED_WRITE); + what = "|"; + break; + case Node_redirect_pipein: + tflag = (RED_PIPE|RED_READ); + what = "|"; + break; + case Node_redirect_input: + tflag = (RED_FILE|RED_READ); + what = "<"; + break; + default: + fatal ("invalid tree type %d in redirect()", tree->type); + break; + } + tmp = tree_eval(tree->subnode); + if (do_lint && ! (tmp->flags & STR)) + warning("expression in `%s' redirection only has numeric value", + what); + tmp = force_string(tmp); + str = tmp->stptr; + if (str == NULL || *str == '\0') + fatal("expression for `%s' redirection has null string value", + what); + if (do_lint + && (STREQN(str, "0", tmp->stlen) || STREQN(str, "1", tmp->stlen))) + warning("filename `%s' for `%s' redirection may be result of logical expression", str, what); + for (rp = red_head; rp != NULL; rp = rp->next) + if (strlen(rp->value) == tmp->stlen + && STREQN(rp->value, str, tmp->stlen) + && ((rp->flag & ~(RED_NOBUF|RED_EOF)) == tflag + || (outflag + && (rp->flag & (RED_FILE|RED_WRITE)) == outflag))) + break; + if (rp == NULL) { + emalloc(rp, struct redirect *, sizeof(struct redirect), + "redirect"); + emalloc(str, char *, tmp->stlen+1, "redirect"); + memcpy(str, tmp->stptr, tmp->stlen); + str[tmp->stlen] = '\0'; + rp->value = str; + rp->flag = tflag; + rp->fp = NULL; + rp->iop = NULL; + rp->pid = 0; /* unlikely that we're worried about init */ + rp->status = 0; + /* maintain list in most-recently-used first order */ + if (red_head) + red_head->prev = rp; + rp->prev = NULL; + rp->next = red_head; + red_head = rp; + } + while (rp->fp == NULL && rp->iop == NULL) { + if (rp->flag & RED_EOF) + /* encountered EOF on file or pipe -- must be cleared + * by explicit close() before reading more + */ + return rp; + mode = NULL; + errno = 0; + switch (tree->type) { + case Node_redirect_output: + mode = "w"; + if (rp->flag & RED_USED) + mode = "a"; + break; + case Node_redirect_append: + mode = "a"; + break; + case Node_redirect_pipe: + if ((rp->fp = popen(str, "w")) == NULL) + fatal("can't open pipe (\"%s\") for output (%s)", + str, strerror(errno)); + rp->flag |= RED_NOBUF; + break; + case Node_redirect_pipein: + direction = "from"; + if (gawk_popen(str, rp) == NULL) + fatal("can't open pipe (\"%s\") for input (%s)", + str, strerror(errno)); + break; + case Node_redirect_input: + direction = "from"; + rp->iop = iop_open(str, "r"); + break; + default: + cant_happen(); + } + if (mode != NULL) { + fd = devopen(str, mode); + if (fd > INVALID_HANDLE) { + if (fd == fileno(stdin)) + rp->fp = stdin; + else if (fd == fileno(stdout)) + rp->fp = stdout; + else if (fd == fileno(stderr)) + rp->fp = stderr; + else + rp->fp = fdopen(fd, mode); + if (isatty(fd)) + rp->flag |= RED_NOBUF; + } + } + if (rp->fp == NULL && rp->iop == NULL) { + /* too many files open -- close one and try again */ + if (errno == EMFILE) + close_one(); + else { + /* + * Some other reason for failure. + * + * On redirection of input from a file, + * just return an error, so e.g. getline + * can return -1. For output to file, + * complain. The shell will complain on + * a bad command to a pipe. + */ + *errflg = errno; + if (tree->type == Node_redirect_output + || tree->type == Node_redirect_append) + fatal("can't redirect %s `%s' (%s)", + direction, str, strerror(errno)); + else { + free_temp(tmp); + return NULL; + } + } + } + } + free_temp(tmp); + return rp; +} + +static void +close_one() +{ + register struct redirect *rp; + register struct redirect *rplast = NULL; + + /* go to end of list first, to pick up least recently used entry */ + for (rp = red_head; rp != NULL; rp = rp->next) + rplast = rp; + /* now work back up through the list */ + for (rp = rplast; rp != NULL; rp = rp->prev) + if (rp->fp && (rp->flag & RED_FILE)) { + rp->flag |= RED_USED; + errno = 0; + if (fclose(rp->fp)) + warning("close of \"%s\" failed (%s).", + rp->value, strerror(errno)); + rp->fp = NULL; + break; + } + if (rp == NULL) + /* surely this is the only reason ??? */ + fatal("too many pipes or input files open"); +} + +NODE * +do_close(tree) +NODE *tree; +{ + NODE *tmp; + register struct redirect *rp; + + tmp = force_string(tree_eval(tree->subnode)); + for (rp = red_head; rp != NULL; rp = rp->next) { + if (strlen(rp->value) == tmp->stlen + && STREQN(rp->value, tmp->stptr, tmp->stlen)) + break; + } + free_temp(tmp); + if (rp == NULL) /* no match */ + return tmp_number((AWKNUM) 0.0); + fflush(stdout); /* synchronize regular output */ + tmp = tmp_number((AWKNUM)close_redir(rp)); + rp = NULL; + return tmp; +} + +static int +close_redir(rp) +register struct redirect *rp; +{ + int status = 0; + + if (rp == NULL) + return 0; + if (rp->fp == stdout || rp->fp == stderr) + return 0; + errno = 0; + if ((rp->flag & (RED_PIPE|RED_WRITE)) == (RED_PIPE|RED_WRITE)) + status = pclose(rp->fp); + else if (rp->fp) + status = fclose(rp->fp); + else if (rp->iop) { + if (rp->flag & RED_PIPE) + status = gawk_pclose(rp); + else { + status = iop_close(rp->iop); + rp->iop = NULL; + } + } + /* SVR4 awk checks and warns about status of close */ + if (status) { + char *s = strerror(errno); + + warning("failure status (%d) on %s close of \"%s\" (%s).", + status, + (rp->flag & RED_PIPE) ? "pipe" : + "file", rp->value, s); + + if (! do_unix) { + /* set ERRNO too so that program can get at it */ + unref(ERRNO_node->var_value); + ERRNO_node->var_value = make_string(s, strlen(s)); + } + } + if (rp->next) + rp->next->prev = rp->prev; + if (rp->prev) + rp->prev->next = rp->next; + else + red_head = rp->next; + free(rp->value); + free((char *)rp); + return status; +} + +int +flush_io () +{ + register struct redirect *rp; + int status = 0; + + errno = 0; + if (fflush(stdout)) { + warning("error writing standard output (%s).", strerror(errno)); + status++; + } + if (fflush(stderr)) { + warning("error writing standard error (%s).", strerror(errno)); + status++; + } + for (rp = red_head; rp != NULL; rp = rp->next) + /* flush both files and pipes, what the heck */ + if ((rp->flag & RED_WRITE) && rp->fp != NULL) { + if (fflush(rp->fp)) { + warning("%s flush of \"%s\" failed (%s).", + (rp->flag & RED_PIPE) ? "pipe" : + "file", rp->value, strerror(errno)); + status++; + } + } + return status; +} + +int +close_io () +{ + register struct redirect *rp; + register struct redirect *next; + int status = 0; + + errno = 0; + if (fclose(stdout)) { + warning("error writing standard output (%s).", strerror(errno)); + status++; + } + if (fclose(stderr)) { + warning("error writing standard error (%s).", strerror(errno)); + status++; + } + for (rp = red_head; rp != NULL; rp = next) { + next = rp->next; + if (close_redir(rp)) + status++; + rp = NULL; + } + return status; +} + +/* str2mode --- convert a string mode to an integer mode */ + +static int +str2mode(mode) +char *mode; +{ + int ret; + + switch(mode[0]) { + case 'r': + ret = O_RDONLY; + break; + + case 'w': + ret = O_WRONLY|O_CREAT|O_TRUNC; + break; + + case 'a': + ret = O_WRONLY|O_APPEND|O_CREAT; + break; + default: + cant_happen(); + } + return ret; +} + +/* devopen --- handle /dev/std{in,out,err}, /dev/fd/N, regular files */ + +/* + * This separate version is still needed for output, since file and pipe + * output is done with stdio. iop_open() handles input with IOBUFs of + * more "special" files. Those files are not handled here since it makes + * no sense to use them for output. + */ + +int +devopen(name, mode) +char *name, *mode; +{ + int openfd = INVALID_HANDLE; + char *cp, *ptr; + int flag = 0; + struct stat buf; + extern double strtod(); + + flag = str2mode(mode); + + if (do_unix) + goto strictopen; + +#ifdef VMS + if ((openfd = vms_devopen(name, flag)) >= 0) + return openfd; +#endif /* VMS */ + + if (STREQ(name, "-")) + openfd = fileno(stdin); + else if (STREQN(name, "/dev/", 5) && stat(name, &buf) == -1) { + cp = name + 5; + + if (STREQ(cp, "stdin") && (flag & O_RDONLY) == O_RDONLY) + openfd = fileno(stdin); + else if (STREQ(cp, "stdout") && (flag & O_WRONLY) == O_WRONLY) + openfd = fileno(stdout); + else if (STREQ(cp, "stderr") && (flag & O_WRONLY) == O_WRONLY) + openfd = fileno(stderr); + else if (STREQN(cp, "fd/", 3)) { + cp += 3; + openfd = (int)strtod(cp, &ptr); + if (openfd <= INVALID_HANDLE || ptr == cp) + openfd = INVALID_HANDLE; + } + } + +strictopen: + if (openfd == INVALID_HANDLE) + openfd = open(name, flag, 0666); + if (openfd != INVALID_HANDLE && fstat(openfd, &buf) > 0) + if (S_ISDIR(buf.st_mode)) + fatal("file `%s' is a directory", name); + return openfd; +} + + +/* spec_setup --- setup an IOBUF for a special internal file */ + +void +spec_setup(iop, len, allocate) +IOBUF *iop; +int len; +int allocate; +{ + char *cp; + + if (allocate) { + emalloc(cp, char *, len+2, "spec_setup"); + iop->buf = cp; + } else { + len = strlen(iop->buf); + iop->buf[len++] = '\n'; /* get_a_record clobbered it */ + iop->buf[len] = '\0'; /* just in case */ + } + iop->off = iop->buf; + iop->cnt = 0; + iop->secsiz = 0; + iop->size = len; + iop->end = iop->buf + len; + iop->fd = -1; + iop->flag = IOP_IS_INTERNAL; +} + +/* specfdopen --- open a fd special file */ + +int +specfdopen(iop, name, mode) +IOBUF *iop; +char *name, *mode; +{ + int fd; + IOBUF *tp; + + fd = devopen(name, mode); + if (fd == INVALID_HANDLE) + return INVALID_HANDLE; + tp = iop_alloc(fd); + if (tp == NULL) + return INVALID_HANDLE; + *iop = *tp; + iop->flag |= IOP_NO_FREE; + free(tp); + return 0; +} + +/* pidopen --- "open" /dev/pid, /dev/ppid, and /dev/pgrpid */ + +int +pidopen(iop, name, mode) +IOBUF *iop; +char *name, *mode; +{ + char tbuf[BUFSIZ]; + int i; + + if (name[6] == 'g') +/* following #if will improve in 2.16 */ +#if defined(__svr4__) || defined(i860) || defined(_AIX) || defined(BSD4_4) || defined(__386BSD__) + sprintf(tbuf, "%d\n", getpgrp()); +#else + sprintf(tbuf, "%d\n", getpgrp(getpid())); +#endif + else if (name[6] == 'i') + sprintf(tbuf, "%d\n", getpid()); + else + sprintf(tbuf, "%d\n", getppid()); + i = strlen(tbuf); + spec_setup(iop, i, 1); + strcpy(iop->buf, tbuf); + return 0; +} + +/* useropen --- "open" /dev/user */ + +/* + * /dev/user creates a record as follows: + * $1 = getuid() + * $2 = geteuid() + * $3 = getgid() + * $4 = getegid() + * If multiple groups are supported, the $5 through $NF are the + * supplementary group set. + */ + +int +useropen(iop, name, mode) +IOBUF *iop; +char *name, *mode; +{ + char tbuf[BUFSIZ], *cp; + int i; +#if defined(NGROUPS_MAX) && NGROUPS_MAX > 0 + int groupset[NGROUPS_MAX]; + int ngroups; +#endif + + sprintf(tbuf, "%d %d %d %d", getuid(), geteuid(), getgid(), getegid()); + + cp = tbuf + strlen(tbuf); +#if defined(NGROUPS_MAX) && NGROUPS_MAX > 0 + ngroups = getgroups(NGROUPS_MAX, groupset); + if (ngroups == -1) + fatal("could not find groups: %s", strerror(errno)); + + for (i = 0; i < ngroups; i++) { + *cp++ = ' '; + sprintf(cp, "%d", groupset[i]); + cp += strlen(cp); + } +#endif + *cp++ = '\n'; + *cp++ = '\0'; + + + i = strlen(tbuf); + spec_setup(iop, i, 1); + strcpy(iop->buf, tbuf); + return 0; +} + +/* iop_open --- handle special and regular files for input */ + +static IOBUF * +iop_open(name, mode) +char *name, *mode; +{ + int openfd = INVALID_HANDLE; + char *cp, *ptr; + int flag = 0; + int i; + struct stat buf; + IOBUF *iop; + static struct internal { + char *name; + int compare; + int (*fp)(); + IOBUF iob; + } table[] = { + { "/dev/fd/", 8, specfdopen }, + { "/dev/stdin", 10, specfdopen }, + { "/dev/stdout", 11, specfdopen }, + { "/dev/stderr", 11, specfdopen }, + { "/dev/pid", 8, pidopen }, + { "/dev/ppid", 9, pidopen }, + { "/dev/pgrpid", 11, pidopen }, + { "/dev/user", 9, useropen }, + }; + int devcount = sizeof(table) / sizeof(table[0]); + + flag = str2mode(mode); + + if (do_unix) + goto strictopen; + + if (STREQ(name, "-")) + openfd = fileno(stdin); + else if (STREQN(name, "/dev/", 5) && stat(name, &buf) == -1) { + int i; + + for (i = 0; i < devcount; i++) { + if (STREQN(name, table[i].name, table[i].compare)) { + IOBUF *iop = & table[i].iob; + + if (iop->buf != NULL) { + spec_setup(iop, 0, 0); + return iop; + } else if ((*table[i].fp)(iop, name, mode) == 0) + return iop; + else { + warning("could not open %s, mode `%s'", + name, mode); + return NULL; + } + } + } + } + +strictopen: + if (openfd == INVALID_HANDLE) + openfd = open(name, flag, 0666); + if (openfd != INVALID_HANDLE && fstat(openfd, &buf) > 0) + if ((buf.st_mode & S_IFMT) == S_IFDIR) + fatal("file `%s' is a directory", name); + iop = iop_alloc(openfd); + return iop; +} + +#ifndef PIPES_SIMULATED + /* real pipes */ +static int +wait_any(interesting) +int interesting; /* pid of interest, if any */ +{ + SIGTYPE (*hstat)(), (*istat)(), (*qstat)(); + int pid; + int status = 0; + struct redirect *redp; + extern int errno; + + hstat = signal(SIGHUP, SIG_IGN); + istat = signal(SIGINT, SIG_IGN); + qstat = signal(SIGQUIT, SIG_IGN); + for (;;) { +#ifdef NeXT + pid = wait((union wait *)&status); +#else + pid = wait(&status); +#endif /* NeXT */ + if (interesting && pid == interesting) { + break; + } else if (pid != -1) { + for (redp = red_head; redp != NULL; redp = redp->next) + if (pid == redp->pid) { + redp->pid = -1; + redp->status = status; + if (redp->fp) { + pclose(redp->fp); + redp->fp = 0; + } + if (redp->iop) { + (void) iop_close(redp->iop); + redp->iop = 0; + } + break; + } + } + if (pid == -1 && errno == ECHILD) + break; + } + signal(SIGHUP, hstat); + signal(SIGINT, istat); + signal(SIGQUIT, qstat); + return(status); +} + +static IOBUF * +gawk_popen(cmd, rp) +char *cmd; +struct redirect *rp; +{ + int p[2]; + register int pid; + + /* used to wait for any children to synchronize input and output, + * but this could cause gawk to hang when it is started in a pipeline + * and thus has a child process feeding it input (shell dependant) + */ + /*(void) wait_any(0);*/ /* wait for outstanding processes */ + + if (pipe(p) < 0) + fatal("cannot open pipe \"%s\" (%s)", cmd, strerror(errno)); + if ((pid = fork()) == 0) { + if (close(1) == -1) + fatal("close of stdout in child failed (%s)", + strerror(errno)); + if (dup(p[1]) != 1) + fatal("dup of pipe failed (%s)", strerror(errno)); + if (close(p[0]) == -1 || close(p[1]) == -1) + fatal("close of pipe failed (%s)", strerror(errno)); + if (close(0) == -1) + fatal("close of stdin in child failed (%s)", + strerror(errno)); + execl("/bin/sh", "sh", "-c", cmd, 0); + _exit(127); + } + if (pid == -1) + fatal("cannot fork for \"%s\" (%s)", cmd, strerror(errno)); + rp->pid = pid; + if (close(p[1]) == -1) + fatal("close of pipe failed (%s)", strerror(errno)); + return (rp->iop = iop_alloc(p[0])); +} + +static int +gawk_pclose(rp) +struct redirect *rp; +{ + (void) iop_close(rp->iop); + rp->iop = NULL; + + /* process previously found, return stored status */ + if (rp->pid == -1) + return (rp->status >> 8) & 0xFF; + rp->status = wait_any(rp->pid); + rp->pid = -1; + return (rp->status >> 8) & 0xFF; +} + +#else /* PIPES_SIMULATED */ + /* use temporary file rather than pipe */ + +#ifdef VMS +static IOBUF * +gawk_popen(cmd, rp) +char *cmd; +struct redirect *rp; +{ + FILE *current; + + if ((current = popen(cmd, "r")) == NULL) + return NULL; + return (rp->iop = iop_alloc(fileno(current))); +} + +static int +gawk_pclose(rp) +struct redirect *rp; +{ + int rval, aval, fd = rp->iop->fd; + FILE *kludge = fdopen(fd, "r"); /* pclose needs FILE* w/ right fileno */ + + rp->iop->fd = dup(fd); /* kludge to allow close() + pclose() */ + rval = iop_close(rp->iop); + rp->iop = NULL; + aval = pclose(kludge); + return (rval < 0 ? rval : aval); +} +#else /* VMS */ + +static +struct { + char *command; + char *name; +} pipes[_NFILE]; + +static IOBUF * +gawk_popen(cmd, rp) +char *cmd; +struct redirect *rp; +{ + extern char *strdup(const char *); + int current; + char *name; + static char cmdbuf[256]; + + /* get a name to use. */ + if ((name = tempnam(".", "pip")) == NULL) + return NULL; + sprintf(cmdbuf,"%s > %s", cmd, name); + system(cmdbuf); + if ((current = open(name,O_RDONLY)) == INVALID_HANDLE) + return NULL; + pipes[current].name = name; + pipes[current].command = strdup(cmd); + rp->iop = iop_alloc(current); + return (rp->iop = iop_alloc(current)); +} + +static int +gawk_pclose(rp) +struct redirect *rp; +{ + int cur = rp->iop->fd; + int rval; + + rval = iop_close(rp->iop); + rp->iop = NULL; + + /* check for an open file */ + if (pipes[cur].name == NULL) + return -1; + unlink(pipes[cur].name); + free(pipes[cur].name); + pipes[cur].name = NULL; + free(pipes[cur].command); + return rval; +} +#endif /* VMS */ + +#endif /* PIPES_SIMULATED */ + +NODE * +do_getline(tree) +NODE *tree; +{ + struct redirect *rp = NULL; + IOBUF *iop; + int cnt = EOF; + char *s = NULL; + int errcode; + + while (cnt == EOF) { + if (tree->rnode == NULL) { /* no redirection */ + iop = nextfile(0); + if (iop == NULL) /* end of input */ + return tmp_number((AWKNUM) 0.0); + } else { + int redir_error = 0; + + rp = redirect(tree->rnode, &redir_error); + if (rp == NULL && redir_error) { /* failed redirect */ + if (! do_unix) { + char *s = strerror(redir_error); + + unref(ERRNO_node->var_value); + ERRNO_node->var_value = + make_string(s, strlen(s)); + } + return tmp_number((AWKNUM) -1.0); + } + iop = rp->iop; + if (iop == NULL) /* end of input */ + return tmp_number((AWKNUM) 0.0); + } + errcode = 0; + cnt = get_a_record(&s, iop, *RS, & errcode); + if (! do_unix && errcode != 0) { + char *s = strerror(errcode); + + unref(ERRNO_node->var_value); + ERRNO_node->var_value = make_string(s, strlen(s)); + return tmp_number((AWKNUM) -1.0); + } + if (cnt == EOF) { + if (rp) { + /* + * Don't do iop_close() here if we are + * reading from a pipe; otherwise + * gawk_pclose will not be called. + */ + if (!(rp->flag & RED_PIPE)) { + (void) iop_close(iop); + rp->iop = NULL; + } + rp->flag |= RED_EOF; /* sticky EOF */ + return tmp_number((AWKNUM) 0.0); + } else + continue; /* try another file */ + } + if (!rp) { + NR += 1; + FNR += 1; + } + if (tree->lnode == NULL) /* no optional var. */ + set_record(s, cnt, 1); + else { /* assignment to variable */ + Func_ptr after_assign = NULL; + NODE **lhs; + + lhs = get_lhs(tree->lnode, &after_assign); + unref(*lhs); + *lhs = make_string(s, strlen(s)); + (*lhs)->flags |= MAYBE_NUM; + /* we may have to regenerate $0 here! */ + if (after_assign) + (*after_assign)(); + } + } + return tmp_number((AWKNUM) 1.0); +} + +int +pathopen (file) +char *file; +{ + int fd = do_pathopen(file); + +#ifdef DEFAULT_FILETYPE + if (! do_unix && fd <= INVALID_HANDLE) { + char *file_awk; + int save = errno; +#ifdef VMS + int vms_save = vaxc$errno; +#endif + + /* append ".awk" and try again */ + emalloc(file_awk, char *, strlen(file) + + sizeof(DEFAULT_FILETYPE) + 1, "pathopen"); + sprintf(file_awk, "%s%s", file, DEFAULT_FILETYPE); + fd = do_pathopen(file_awk); + free(file_awk); + if (fd <= INVALID_HANDLE) { + errno = save; +#ifdef VMS + vaxc$errno = vms_save; +#endif + } + } +#endif /*DEFAULT_FILETYPE*/ + + return fd; +} + +static int +do_pathopen (file) +char *file; +{ + static char *savepath = DEFPATH; /* defined in config.h */ + static int first = 1; + char *awkpath, *cp; + char trypath[BUFSIZ]; + int fd; + + if (STREQ(file, "-")) + return (0); + + if (do_unix) + return (devopen(file, "r")); + + if (first) { + first = 0; + if ((awkpath = getenv ("AWKPATH")) != NULL && *awkpath) + savepath = awkpath; /* used for restarting */ + } + awkpath = savepath; + + /* some kind of path name, no search */ +#ifdef VMS /* (strchr not equal implies either or both not NULL) */ + if (strchr(file, ':') != strchr(file, ']') + || strchr(file, '>') != strchr(file, '/')) +#else /*!VMS*/ +#ifdef MSDOS + if (strchr(file, '/') != strchr(file, '\\') + || strchr(file, ':') != NULL) +#else + if (strchr(file, '/') != NULL) +#endif /*MSDOS*/ +#endif /*VMS*/ + return (devopen(file, "r")); + + do { + trypath[0] = '\0'; + /* this should take into account limits on size of trypath */ + for (cp = trypath; *awkpath && *awkpath != ENVSEP; ) + *cp++ = *awkpath++; + + if (cp != trypath) { /* nun-null element in path */ + /* add directory punctuation only if needed */ +#ifdef VMS + if (strchr(":]>/", *(cp-1)) == NULL) +#else +#ifdef MSDOS + if (strchr(":\\/", *(cp-1)) == NULL) +#else + if (*(cp-1) != '/') +#endif +#endif + *cp++ = '/'; + /* append filename */ + strcpy (cp, file); + } else + strcpy (trypath, file); + if ((fd = devopen(trypath, "r")) >= 0) + return (fd); + + /* no luck, keep going */ + if(*awkpath == ENVSEP && awkpath[1] != '\0') + awkpath++; /* skip colon */ + } while (*awkpath); + /* + * You might have one of the awk + * paths defined, WITHOUT the current working directory in it. + * Therefore try to open the file in the current directory. + */ + return (devopen(file, "r")); +} diff --git a/gnu/usr.bin/awk/iop.c b/gnu/usr.bin/awk/iop.c new file mode 100644 index 000000000000..0d7af1213db6 --- /dev/null +++ b/gnu/usr.bin/awk/iop.c @@ -0,0 +1,318 @@ +/* + * iop.c - do i/o related things. + */ + +/* + * Copyright (C) 1986, 1988, 1989, 1991, 1992 the Free Software Foundation, Inc. + * + * This file is part of GAWK, the GNU implementation of the + * AWK Progamming Language. + * + * GAWK is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * GAWK is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with GAWK; see the file COPYING. If not, write to + * the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +#include "awk.h" + +#ifndef atarist +#define INVALID_HANDLE (-1) +#else +#include <stddef.h> +#include <fcntl.h> +#define INVALID_HANDLE (__SMALLEST_VALID_HANDLE - 1) +#endif /* atarist */ + + +#ifdef TEST +int bufsize = 8192; + +void +fatal(s) +char *s; +{ + printf("%s\n", s); + exit(1); +} +#endif + +int +optimal_bufsize(fd) +int fd; +{ + struct stat stb; + +#ifdef VMS + /* + * These values correspond with the RMS multi-block count used by + * vms_open() in vms/vms_misc.c. + */ + if (isatty(fd) > 0) + return BUFSIZ; + else if (fstat(fd, &stb) < 0) + return 8*512; /* conservative in case of DECnet access */ + else + return 24*512; + +#else + /* + * System V doesn't have the file system block size in the + * stat structure. So we have to make some sort of reasonable + * guess. We use stdio's BUFSIZ, since that is what it was + * meant for in the first place. + */ +#ifdef BLKSIZE_MISSING +#define DEFBLKSIZE BUFSIZ +#else +#define DEFBLKSIZE (stb.st_blksize ? stb.st_blksize : BUFSIZ) +#endif + +#ifdef TEST + return bufsize; +#else +#ifndef atarist + if (isatty(fd)) +#else + /* + * On ST redirected stdin does not have a name attached + * (this could be hard to do to) and fstat would fail + */ + if (0 == fd || isatty(fd)) +#endif /*atarist */ + return BUFSIZ; +#ifndef BLKSIZE_MISSING + /* VMS POSIX 1.0: st_blksize is never assigned a value, so zero it */ + stb.st_blksize = 0; +#endif + if (fstat(fd, &stb) == -1) + fatal("can't stat fd %d (%s)", fd, strerror(errno)); + if (lseek(fd, (off_t)0, 0) == -1) + return DEFBLKSIZE; + return ((int) (stb.st_size < DEFBLKSIZE ? stb.st_size : DEFBLKSIZE)); +#endif /*! TEST */ +#endif /*! VMS */ +} + +IOBUF * +iop_alloc(fd) +int fd; +{ + IOBUF *iop; + + if (fd == INVALID_HANDLE) + return NULL; + emalloc(iop, IOBUF *, sizeof(IOBUF), "iop_alloc"); + iop->flag = 0; + if (isatty(fd)) + iop->flag |= IOP_IS_TTY; + iop->size = optimal_bufsize(fd); + iop->secsiz = -2; + errno = 0; + iop->fd = fd; + iop->off = iop->buf = NULL; + iop->cnt = 0; + return iop; +} + +/* + * Get the next record. Uses a "split buffer" where the latter part is + * the normal read buffer and the head part is an "overflow" area that is used + * when a record spans the end of the normal buffer, in which case the first + * part of the record is copied into the overflow area just before the + * normal buffer. Thus, the eventual full record can be returned as a + * contiguous area of memory with a minimum of copying. The overflow area + * is expanded as needed, so that records are unlimited in length. + * We also mark both the end of the buffer and the end of the read() with + * a sentinel character (the current record separator) so that the inside + * loop can run as a single test. + */ +int +get_a_record(out, iop, grRS, errcode) +char **out; +IOBUF *iop; +register int grRS; +int *errcode; +{ + register char *bp = iop->off; + char *bufend; + char *start = iop->off; /* beginning of record */ + int saw_newline; + char rs; + int eat_whitespace; + + if (iop->cnt == EOF) /* previous read hit EOF */ + return EOF; + + if (grRS == 0) { /* special case: grRS == "" */ + rs = '\n'; + eat_whitespace = 0; + saw_newline = 0; + } else + rs = (char) grRS; + + /* set up sentinel */ + if (iop->buf) { + bufend = iop->buf + iop->size + iop->secsiz; + *bufend = rs; + } else + bufend = NULL; + + for (;;) { /* break on end of record, read error or EOF */ + + /* Following code is entered on the first call of this routine + * for a new iop, or when we scan to the end of the buffer. + * In the latter case, we copy the current partial record to + * the space preceding the normal read buffer. If necessary, + * we expand this space. This is done so that we can return + * the record as a contiguous area of memory. + */ + if ((iop->flag & IOP_IS_INTERNAL) == 0 && bp >= bufend) { + char *oldbuf = NULL; + char *oldsplit = iop->buf + iop->secsiz; + long len; /* record length so far */ + + if ((iop->flag & IOP_IS_INTERNAL) != 0) + cant_happen(); + + len = bp - start; + if (len > iop->secsiz) { + /* expand secondary buffer */ + if (iop->secsiz == -2) + iop->secsiz = 256; + while (len > iop->secsiz) + iop->secsiz *= 2; + oldbuf = iop->buf; + emalloc(iop->buf, char *, + iop->size+iop->secsiz+2, "get_a_record"); + bufend = iop->buf + iop->size + iop->secsiz; + *bufend = rs; + } + if (len > 0) { + char *newsplit = iop->buf + iop->secsiz; + + if (start < oldsplit) { + memcpy(newsplit - len, start, + oldsplit - start); + memcpy(newsplit - (bp - oldsplit), + oldsplit, bp - oldsplit); + } else + memcpy(newsplit - len, start, len); + } + bp = iop->end = iop->off = iop->buf + iop->secsiz; + start = bp - len; + if (oldbuf) { + free(oldbuf); + oldbuf = NULL; + } + } + /* Following code is entered whenever we have no more data to + * scan. In most cases this will read into the beginning of + * the main buffer, but in some cases (terminal, pipe etc.) + * we may be doing smallish reads into more advanced positions. + */ + if (bp >= iop->end) { + if ((iop->flag & IOP_IS_INTERNAL) != 0) { + iop->cnt = EOF; + break; + } + iop->cnt = read(iop->fd, iop->end, bufend - iop->end); + if (iop->cnt == -1) { + if (! do_unix && errcode != NULL) { + *errcode = errno; + iop->cnt = EOF; + break; + } else + fatal("error reading input: %s", + strerror(errno)); + } else if (iop->cnt == 0) { + iop->cnt = EOF; + break; + } + iop->end += iop->cnt; + *iop->end = rs; + } + if (grRS == 0) { + extern int default_FS; + + if (default_FS && (bp == start || eat_whitespace)) { + while (bp < iop->end && isspace(*bp)) + bp++; + if (bp == iop->end) { + eat_whitespace = 1; + continue; + } else + eat_whitespace = 0; + } + if (saw_newline && *bp == rs) { + bp++; + break; + } + saw_newline = 0; + } + + while (*bp++ != rs) + ; + + if (bp <= iop->end) { + if (grRS == 0) + saw_newline = 1; + else + break; + } else + bp--; + + if ((iop->flag & IOP_IS_INTERNAL) != 0) + iop->cnt = bp - start; + } + if (iop->cnt == EOF + && (((iop->flag & IOP_IS_INTERNAL) != 0) || start == bp)) + return EOF; + + iop->off = bp; + bp--; + if (*bp != rs) + bp++; + *bp = '\0'; + if (grRS == 0) { + if (*--bp == rs) + *bp = '\0'; + else + bp++; + } + + *out = start; + return bp - start; +} + +#ifdef TEST +main(argc, argv) +int argc; +char *argv[]; +{ + IOBUF *iop; + char *out; + int cnt; + char rs[2]; + + rs[0] = 0; + if (argc > 1) + bufsize = atoi(argv[1]); + if (argc > 2) + rs[0] = *argv[2]; + iop = iop_alloc(0); + while ((cnt = get_a_record(&out, iop, rs[0], NULL)) > 0) { + fwrite(out, 1, cnt, stdout); + fwrite(rs, 1, 1, stdout); + } +} +#endif diff --git a/gnu/usr.bin/awk/main.c b/gnu/usr.bin/awk/main.c new file mode 100644 index 000000000000..77d0bf74e143 --- /dev/null +++ b/gnu/usr.bin/awk/main.c @@ -0,0 +1,731 @@ +/* + * main.c -- Expression tree constructors and main program for gawk. + */ + +/* + * Copyright (C) 1986, 1988, 1989, 1991, 1992 the Free Software Foundation, Inc. + * + * This file is part of GAWK, the GNU implementation of the + * AWK Progamming Language. + * + * GAWK is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * GAWK is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with GAWK; see the file COPYING. If not, write to + * the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +#include "getopt.h" +#include "awk.h" +#include "patchlevel.h" + +static void usage P((int exitval)); +static void copyleft P((void)); +static void cmdline_fs P((char *str)); +static void init_args P((int argc0, int argc, char *argv0, char **argv)); +static void init_vars P((void)); +static void pre_assign P((char *v)); +SIGTYPE catchsig P((int sig, int code)); +static void gawk_option P((char *optstr)); +static void nostalgia P((void)); +static void version P((void)); +char *gawk_name P((char *filespec)); + +#ifdef MSDOS +extern int isatty P((int)); +#endif + +extern void resetup P((void)); + +/* These nodes store all the special variables AWK uses */ +NODE *FS_node, *NF_node, *RS_node, *NR_node; +NODE *FILENAME_node, *OFS_node, *ORS_node, *OFMT_node; +NODE *CONVFMT_node; +NODE *ERRNO_node; +NODE *FNR_node, *RLENGTH_node, *RSTART_node, *SUBSEP_node; +NODE *ENVIRON_node, *IGNORECASE_node; +NODE *ARGC_node, *ARGV_node, *ARGIND_node; +NODE *FIELDWIDTHS_node; + +int NF; +int NR; +int FNR; +int IGNORECASE; +char *RS; +char *OFS; +char *ORS; +char *OFMT; +char *CONVFMT; + +/* + * The parse tree and field nodes are stored here. Parse_end is a dummy item + * used to free up unneeded fields without freeing the program being run + */ +int errcount = 0; /* error counter, used by yyerror() */ + +/* The global null string */ +NODE *Nnull_string; + +/* The name the program was invoked under, for error messages */ +const char *myname; + +/* A block of AWK code to be run before running the program */ +NODE *begin_block = 0; + +/* A block of AWK code to be run after the last input file */ +NODE *end_block = 0; + +int exiting = 0; /* Was an "exit" statement executed? */ +int exit_val = 0; /* optional exit value */ + +#if defined(YYDEBUG) || defined(DEBUG) +extern int yydebug; +#endif + +struct src *srcfiles = NULL; /* source file name(s) */ +int numfiles = -1; /* how many source files */ + +int do_unix = 0; /* turn off gnu extensions */ +int do_posix = 0; /* turn off gnu and unix extensions */ +int do_lint = 0; /* provide warnings about questionable stuff */ +int do_nostalgia = 0; /* provide a blast from the past */ + +int in_begin_rule = 0; /* we're in a BEGIN rule */ +int in_end_rule = 0; /* we're in a END rule */ + +int output_is_tty = 0; /* control flushing of output */ + +extern char *version_string; /* current version, for printing */ + +NODE *expression_value; + +static struct option optab[] = { + { "compat", no_argument, & do_unix, 1 }, + { "lint", no_argument, & do_lint, 1 }, + { "posix", no_argument, & do_posix, 1 }, + { "nostalgia", no_argument, & do_nostalgia, 1 }, + { "copyleft", no_argument, NULL, 'C' }, + { "copyright", no_argument, NULL, 'C' }, + { "field-separator", required_argument, NULL, 'F' }, + { "file", required_argument, NULL, 'f' }, + { "assign", required_argument, NULL, 'v' }, + { "version", no_argument, NULL, 'V' }, + { "usage", no_argument, NULL, 'u' }, + { "help", no_argument, NULL, 'u' }, + { "source", required_argument, NULL, 's' }, +#ifdef DEBUG + { "parsedebug", no_argument, NULL, 'D' }, +#endif + { 0, 0, 0, 0 } +}; + +int +main(argc, argv) +int argc; +char **argv; +{ + int c; + char *scan; + extern int optind; + extern int opterr; + extern char *optarg; + int i; + + (void) signal(SIGFPE, (SIGTYPE (*) P((int))) catchsig); + (void) signal(SIGSEGV, (SIGTYPE (*) P((int))) catchsig); +#ifdef SIGBUS + (void) signal(SIGBUS, (SIGTYPE (*) P((int))) catchsig); +#endif + + myname = gawk_name(argv[0]); + argv[0] = (char *)myname; +#ifdef VMS + vms_arg_fixup(&argc, &argv); /* emulate redirection, expand wildcards */ +#endif + + /* remove sccs gunk */ + if (strncmp(version_string, "@(#)", 4) == 0) + version_string += 4; + + if (argc < 2) + usage(1); + + /* initialize the null string */ + Nnull_string = make_string("", 0); + Nnull_string->numbr = 0.0; + Nnull_string->type = Node_val; + Nnull_string->flags = (PERM|STR|STRING|NUM|NUMBER); + + /* Set up the special variables */ + + /* + * Note that this must be done BEFORE arg parsing else -F + * breaks horribly + */ + init_vars(); + + /* worst case */ + emalloc(srcfiles, struct src *, argc * sizeof(struct src), "main"); + memset(srcfiles, '\0', argc * sizeof(struct src)); + + /* Tell the regex routines how they should work. . . */ + resetup(); + + /* we do error messages ourselves on invalid options */ + opterr = 0; + + /* the + on the front tells GNU getopt not to rearrange argv */ + while ((c = getopt_long(argc, argv, "+F:f:v:W:", optab, NULL)) != EOF) { + if (do_posix) + opterr = 1; + switch (c) { + case 'F': + cmdline_fs(optarg); + break; + + case 'f': + /* + * a la MKS awk, allow multiple -f options. + * this makes function libraries real easy. + * most of the magic is in the scanner. + */ + /* The following is to allow for whitespace at the end + * of a #! /bin/gawk line in an executable file + */ + scan = optarg; + while (isspace(*scan)) + scan++; + ++numfiles; + srcfiles[numfiles].stype = SOURCEFILE; + if (*scan == '\0') + srcfiles[numfiles].val = argv[optind++]; + else + srcfiles[numfiles].val = optarg; + break; + + case 'v': + pre_assign(optarg); + break; + + case 'W': /* gawk specific options */ + gawk_option(optarg); + break; + + /* These can only come from long form options */ + case 'V': + version(); + break; + + case 'C': + copyleft(); + break; + + case 'u': + usage(0); + break; + + case 's': + if (strlen(optarg) == 0) + warning("empty argument to --source ignored"); + else { + srcfiles[++numfiles].stype = CMDLINE; + srcfiles[numfiles].val = optarg; + } + break; + +#ifdef DEBUG + case 'D': + yydebug = 2; + break; +#endif + + case '?': + default: + /* + * New behavior. If not posix, an unrecognized + * option stops argument processing so that it can + * go into ARGV for the awk program to see. This + * makes use of ``#! /bin/gawk -f'' easier. + */ + if (! do_posix) + goto out; + /* else + let getopt print error message for us */ + break; + } + } +out: + + if (do_nostalgia) + nostalgia(); + + /* POSIX compliance also implies no Unix extensions either */ + if (do_posix) + do_unix = 1; + +#ifdef DEBUG + setbuf(stdout, (char *) NULL); /* make debugging easier */ +#endif + if (isatty(fileno(stdout))) + output_is_tty = 1; + /* No -f or --source options, use next arg */ + if (numfiles == -1) { + if (optind > argc - 1) /* no args left */ + usage(1); + srcfiles[++numfiles].stype = CMDLINE; + srcfiles[numfiles].val = argv[optind]; + optind++; + } + init_args(optind, argc, (char *) myname, argv); + (void) tokexpand(); + + /* Read in the program */ + if (yyparse() || errcount) + exit(1); + + /* Set up the field variables */ + init_fields(); + + if (begin_block) { + in_begin_rule = 1; + (void) interpret(begin_block); + } + in_begin_rule = 0; + if (!exiting && (expression_value || end_block)) + do_input(); + if (end_block) { + in_end_rule = 1; + (void) interpret(end_block); + } + in_end_rule = 0; + if (close_io() != 0 && exit_val == 0) + exit_val = 1; + exit(exit_val); /* more portable */ + return exit_val; /* to suppress warnings */ +} + +/* usage --- print usage information and exit */ + +static void +usage(exitval) +int exitval; +{ + char *opt1 = " -f progfile [--]"; + char *opt2 = " [--] 'program'"; + char *regops = " [POSIX or GNU style options]"; + + version(); + fprintf(stderr, "usage: %s%s%s file ...\n %s%s%s file ...\n", + myname, regops, opt1, myname, regops, opt2); + + /* GNU long options info. Gack. */ + fputs("\nPOSIX options:\t\tGNU long options:\n", stderr); + fputs("\t-f progfile\t\t--file=progfile\n", stderr); + fputs("\t-F fs\t\t\t--field-separator=fs\n", stderr); + fputs("\t-v var=val\t\t--assign=var=val\n", stderr); + fputs("\t-W compat\t\t--compat\n", stderr); + fputs("\t-W copyleft\t\t--copyleft\n", stderr); + fputs("\t-W copyright\t\t--copyright\n", stderr); + fputs("\t-W help\t\t\t--help\n", stderr); + fputs("\t-W lint\t\t\t--lint\n", stderr); +#if 0 + fputs("\t-W nostalgia\t\t--nostalgia\n", stderr); +#endif +#ifdef DEBUG + fputs("\t-W parsedebug\t\t--parsedebug\n", stderr); +#endif + fputs("\t-W posix\t\t--posix\n", stderr); + fputs("\t-W source=program-text\t--source=program-text\n", stderr); + fputs("\t-W usage\t\t--usage\n", stderr); + fputs("\t-W version\t\t--version\n", stderr); + exit(exitval); +} + +static void +copyleft () +{ + static char blurb_part1[] = +"Copyright (C) 1989, 1991, 1992, Free Software Foundation.\n\ +\n\ +This program is free software; you can redistribute it and/or modify\n\ +it under the terms of the GNU General Public License as published by\n\ +the Free Software Foundation; either version 2 of the License, or\n\ +(at your option) any later version.\n\ +\n"; + static char blurb_part2[] = +"This program is distributed in the hope that it will be useful,\n\ +but WITHOUT ANY WARRANTY; without even the implied warranty of\n\ +MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the\n\ +GNU General Public License for more details.\n\ +\n"; + static char blurb_part3[] = +"You should have received a copy of the GNU General Public License\n\ +along with this program; if not, write to the Free Software\n\ +Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA.\n"; + + version(); + fputs(blurb_part1, stderr); + fputs(blurb_part2, stderr); + fputs(blurb_part3, stderr); + fflush(stderr); +} + +static void +cmdline_fs(str) +char *str; +{ + register NODE **tmp; + int len = strlen(str); + + tmp = get_lhs(FS_node, (Func_ptr *) 0); + unref(*tmp); + /* + * Only if in full compatibility mode check for the stupid special + * case so -F\t works as documented in awk even though the shell + * hands us -Ft. Bleah! + * + * Thankfully, Posix didn't propogate this "feature". + */ + if (str[0] == 't' && str[1] == '\0') { + if (do_lint) + warning("-Ft does not set FS to tab in POSIX awk"); + if (do_unix && ! do_posix) + str[0] = '\t'; + } + *tmp = make_str_node(str, len, SCAN); /* do process escapes */ + set_FS(); +} + +static void +init_args(argc0, argc, argv0, argv) +int argc0, argc; +char *argv0; +char **argv; +{ + int i, j; + NODE **aptr; + + ARGV_node = install("ARGV", node(Nnull_string, Node_var, (NODE *)NULL)); + aptr = assoc_lookup(ARGV_node, tmp_number(0.0)); + *aptr = make_string(argv0, strlen(argv0)); + (*aptr)->flags |= MAYBE_NUM; + for (i = argc0, j = 1; i < argc; i++) { + aptr = assoc_lookup(ARGV_node, tmp_number((AWKNUM) j)); + *aptr = make_string(argv[i], strlen(argv[i])); + (*aptr)->flags |= MAYBE_NUM; + j++; + } + ARGC_node = install("ARGC", + node(make_number((AWKNUM) j), Node_var, (NODE *) NULL)); +} + +/* + * Set all the special variables to their initial values. + */ +struct varinit { + NODE **spec; + char *name; + NODETYPE type; + char *strval; + AWKNUM numval; + Func_ptr assign; +}; +static struct varinit varinit[] = { +{&NF_node, "NF", Node_NF, 0, -1, set_NF }, +{&FIELDWIDTHS_node, "FIELDWIDTHS", Node_FIELDWIDTHS, "", 0, 0 }, +{&NR_node, "NR", Node_NR, 0, 0, set_NR }, +{&FNR_node, "FNR", Node_FNR, 0, 0, set_FNR }, +{&FS_node, "FS", Node_FS, " ", 0, 0 }, +{&RS_node, "RS", Node_RS, "\n", 0, set_RS }, +{&IGNORECASE_node, "IGNORECASE", Node_IGNORECASE, 0, 0, set_IGNORECASE }, +{&FILENAME_node, "FILENAME", Node_var, "-", 0, 0 }, +{&OFS_node, "OFS", Node_OFS, " ", 0, set_OFS }, +{&ORS_node, "ORS", Node_ORS, "\n", 0, set_ORS }, +{&OFMT_node, "OFMT", Node_OFMT, "%.6g", 0, set_OFMT }, +{&CONVFMT_node, "CONVFMT", Node_CONVFMT, "%.6g", 0, set_CONVFMT }, +{&RLENGTH_node, "RLENGTH", Node_var, 0, 0, 0 }, +{&RSTART_node, "RSTART", Node_var, 0, 0, 0 }, +{&SUBSEP_node, "SUBSEP", Node_var, "\034", 0, 0 }, +{&ARGIND_node, "ARGIND", Node_var, 0, 0, 0 }, +{&ERRNO_node, "ERRNO", Node_var, 0, 0, 0 }, +{0, 0, Node_illegal, 0, 0, 0 }, +}; + +static void +init_vars() +{ + register struct varinit *vp; + + for (vp = varinit; vp->name; vp++) { + *(vp->spec) = install(vp->name, + node(vp->strval == 0 ? make_number(vp->numval) + : make_string(vp->strval, strlen(vp->strval)), + vp->type, (NODE *) NULL)); + if (vp->assign) + (*(vp->assign))(); + } +} + +void +load_environ() +{ +#if !defined(MSDOS) && !(defined(VMS) && defined(__DECC)) + extern char **environ; +#endif + register char *var, *val; + NODE **aptr; + register int i; + + ENVIRON_node = install("ENVIRON", + node(Nnull_string, Node_var, (NODE *) NULL)); + for (i = 0; environ[i]; i++) { + static char nullstr[] = ""; + + var = environ[i]; + val = strchr(var, '='); + if (val) + *val++ = '\0'; + else + val = nullstr; + aptr = assoc_lookup(ENVIRON_node, tmp_string(var, strlen (var))); + *aptr = make_string(val, strlen (val)); + (*aptr)->flags |= MAYBE_NUM; + + /* restore '=' so that system() gets a valid environment */ + if (val != nullstr) + *--val = '='; + } +} + +/* Process a command-line assignment */ +char * +arg_assign(arg) +char *arg; +{ + char *cp; + Func_ptr after_assign = NULL; + NODE *var; + NODE *it; + NODE **lhs; + + cp = strchr(arg, '='); + if (cp != NULL) { + *cp++ = '\0'; + /* + * Recent versions of nawk expand escapes inside assignments. + * This makes sense, so we do it too. + */ + it = make_str_node(cp, strlen(cp), SCAN); + it->flags |= MAYBE_NUM; + var = variable(arg, 0); + lhs = get_lhs(var, &after_assign); + unref(*lhs); + *lhs = it; + if (after_assign) + (*after_assign)(); + *--cp = '='; /* restore original text of ARGV */ + } + return cp; +} + +static void +pre_assign(v) +char *v; +{ + if (!arg_assign(v)) { + fprintf (stderr, + "%s: '%s' argument to -v not in 'var=value' form\n", + myname, v); + usage(1); + } +} + +SIGTYPE +catchsig(sig, code) +int sig, code; +{ +#ifdef lint + code = 0; sig = code; code = sig; +#endif + if (sig == SIGFPE) { + fatal("floating point exception"); + } else if (sig == SIGSEGV +#ifdef SIGBUS + || sig == SIGBUS +#endif + ) { + msg("fatal error: internal error"); + /* fatal won't abort() if not compiled for debugging */ + abort(); + } else + cant_happen(); + /* NOTREACHED */ +} + +/* gawk_option --- do gawk specific things */ + +static void +gawk_option(optstr) +char *optstr; +{ + char *cp; + + for (cp = optstr; *cp; cp++) { + switch (*cp) { + case ' ': + case '\t': + case ',': + break; + case 'v': + case 'V': + /* print version */ + if (strncasecmp(cp, "version", 7) != 0) + goto unknown; + else + cp += 6; + version(); + break; + case 'c': + case 'C': + if (strncasecmp(cp, "copyright", 9) == 0) { + cp += 8; + copyleft(); + } else if (strncasecmp(cp, "copyleft", 8) == 0) { + cp += 7; + copyleft(); + } else if (strncasecmp(cp, "compat", 6) == 0) { + cp += 5; + do_unix = 1; + } else + goto unknown; + break; + case 'n': + case 'N': + /* + * Undocumented feature, + * inspired by nostalgia, and a T-shirt + */ + if (strncasecmp(cp, "nostalgia", 9) != 0) + goto unknown; + nostalgia(); + break; + case 'p': + case 'P': +#ifdef DEBUG + if (strncasecmp(cp, "parsedebug", 10) == 0) { + cp += 9; + yydebug = 2; + break; + } +#endif + if (strncasecmp(cp, "posix", 5) != 0) + goto unknown; + cp += 4; + do_posix = do_unix = 1; + break; + case 'l': + case 'L': + if (strncasecmp(cp, "lint", 4) != 0) + goto unknown; + cp += 3; + do_lint = 1; + break; + case 'H': + case 'h': + if (strncasecmp(cp, "help", 4) != 0) + goto unknown; + cp += 3; + usage(0); + break; + case 'U': + case 'u': + if (strncasecmp(cp, "usage", 5) != 0) + goto unknown; + cp += 4; + usage(0); + break; + case 's': + case 'S': + if (strncasecmp(cp, "source=", 7) != 0) + goto unknown; + cp += 7; + if (strlen(cp) == 0) + warning("empty argument to -Wsource ignored"); + else { + srcfiles[++numfiles].stype = CMDLINE; + srcfiles[numfiles].val = cp; + return; + } + break; + default: + unknown: + fprintf(stderr, "'%c' -- unknown option, ignored\n", + *cp); + break; + } + } +} + +/* nostalgia --- print the famous error message and die */ + +static void +nostalgia() +{ + fprintf(stderr, "awk: bailing out near line 1\n"); + abort(); +} + +/* version --- print version message */ + +static void +version() +{ + fprintf(stderr, "%s, patchlevel %d\n", version_string, PATCHLEVEL); +} + +/* static */ +char * +gawk_name(filespec) +char *filespec; +{ + char *p; + +#ifdef VMS /* "device:[root.][directory.subdir]GAWK.EXE;n" -> "GAWK" */ + char *q; + + p = strrchr(filespec, ']'); /* directory punctuation */ + q = strrchr(filespec, '>'); /* alternate <international> punct */ + + if (p == NULL || q > p) p = q; + p = strdup(p == NULL ? filespec : (p + 1)); + if ((q = strrchr(p, '.')) != NULL) *q = '\0'; /* strip .typ;vers */ + + return p; +#endif /*VMS*/ + +#if defined(MSDOS) || defined(atarist) + char *q; + + p = filespec; + + if (q = strrchr(p, '\\')) + p = q + 1; + if (q = strchr(p, '.')) + *q = '\0'; + strlwr(p); + + return (p == NULL ? filespec : p); +#endif /* MSDOS || atarist */ + + /* "path/name" -> "name" */ + p = strrchr(filespec, '/'); + return (p == NULL ? filespec : p + 1); +} diff --git a/gnu/usr.bin/awk/msg.c b/gnu/usr.bin/awk/msg.c new file mode 100644 index 000000000000..b60fe9d1e5e9 --- /dev/null +++ b/gnu/usr.bin/awk/msg.c @@ -0,0 +1,106 @@ +/* + * msg.c - routines for error messages + */ + +/* + * Copyright (C) 1986, 1988, 1989, 1991, 1992 the Free Software Foundation, Inc. + * + * This file is part of GAWK, the GNU implementation of the + * AWK Progamming Language. + * + * GAWK is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2, or (at your option) + * any later version. + * + * GAWK is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with GAWK; see the file COPYING. If not, write to + * the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +#include "awk.h" + +int sourceline = 0; +char *source = NULL; + +/* VARARGS2 */ +void +err(s, emsg, argp) +char *s; +char *emsg; +va_list argp; +{ + char *file; + + (void) fflush(stdout); + (void) fprintf(stderr, "%s: ", myname); + if (sourceline) { + if (source) + (void) fprintf(stderr, "%s:", source); + else + (void) fprintf(stderr, "cmd. line:"); + + (void) fprintf(stderr, "%d: ", sourceline); + } + if (FNR) { + file = FILENAME_node->var_value->stptr; + if (file) + (void) fprintf(stderr, "(FILENAME=%s ", file); + (void) fprintf(stderr, "FNR=%d) ", FNR); + } + (void) fprintf(stderr, s); + vfprintf(stderr, emsg, argp); + (void) fprintf(stderr, "\n"); + (void) fflush(stderr); +} + +/*VARARGS0*/ +void +msg(va_alist) +va_dcl +{ + va_list args; + char *mesg; + + va_start(args); + mesg = va_arg(args, char *); + err("", mesg, args); + va_end(args); +} + +/*VARARGS0*/ +void +warning(va_alist) +va_dcl +{ + va_list args; + char *mesg; + + va_start(args); + mesg = va_arg(args, char *); + err("warning: ", mesg, args); + va_end(args); +} + +/*VARARGS0*/ +void +fatal(va_alist) +va_dcl +{ + va_list args; + char *mesg; + + va_start(args); + mesg = va_arg(args, char *); + err("fatal: ", mesg, args); + va_end(args); +#ifdef DEBUG + abort(); +#endif + exit(2); +} diff --git a/gnu/usr.bin/awk/node.c b/gnu/usr.bin/awk/node.c new file mode 100644 index 000000000000..65ecb0ed1723 --- /dev/null +++ b/gnu/usr.bin/awk/node.c @@ -0,0 +1,429 @@ +/* + * node.c -- routines for node management + */ + +/* + * Copyright (C) 1986, 1988, 1989, 1991, 1992 the Free Software Foundation, Inc. + * + * This file is part of GAWK, the GNU implementation of the + * AWK Progamming Language. + * + * GAWK is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * GAWK is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with GAWK; see the file COPYING. If not, write to + * the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +#include "awk.h" + +extern double strtod(); + +AWKNUM +r_force_number(n) +register NODE *n; +{ + register char *cp; + register char *cpend; + char save; + char *ptr; + unsigned int newflags = 0; + +#ifdef DEBUG + if (n == NULL) + cant_happen(); + if (n->type != Node_val) + cant_happen(); + if(n->flags == 0) + cant_happen(); + if (n->flags & NUM) + return n->numbr; +#endif + + /* all the conditionals are an attempt to avoid the expensive strtod */ + + n->numbr = 0.0; + n->flags |= NUM; + + if (n->stlen == 0) + return 0.0; + + cp = n->stptr; + if (isalpha(*cp)) + return 0.0; + + cpend = cp + n->stlen; + while (cp < cpend && isspace(*cp)) + cp++; + if (cp == cpend || isalpha(*cp)) + return 0.0; + + if (n->flags & MAYBE_NUM) { + newflags = NUMBER; + n->flags &= ~MAYBE_NUM; + } + if (cpend - cp == 1) { + if (isdigit(*cp)) { + n->numbr = (AWKNUM)(*cp - '0'); + n->flags |= newflags; + } + return n->numbr; + } + + errno = 0; + save = *cpend; + *cpend = '\0'; + n->numbr = (AWKNUM) strtod((const char *)cp, &ptr); + + /* POSIX says trailing space is OK for NUMBER */ + while (isspace(*ptr)) + ptr++; + *cpend = save; + /* the >= should be ==, but for SunOS 3.5 strtod() */ + if (errno == 0 && ptr >= cpend) + n->flags |= newflags; + else + errno = 0; + + return n->numbr; +} + +/* + * the following lookup table is used as an optimization in force_string + * (more complicated) variations on this theme didn't seem to pay off, but + * systematic testing might be in order at some point + */ +static char *values[] = { + "0", + "1", + "2", + "3", + "4", + "5", + "6", + "7", + "8", + "9", +}; +#define NVAL (sizeof(values)/sizeof(values[0])) + +NODE * +r_force_string(s) +register NODE *s; +{ + char buf[128]; + register char *sp = buf; + register long num = 0; + +#ifdef DEBUG + if (s == NULL) cant_happen(); + if (s->type != Node_val) cant_happen(); + if (s->flags & STR) return s; + if (!(s->flags & NUM)) cant_happen(); + if (s->stref != 0) ; /*cant_happen();*/ +#endif + + /* avoids floating point exception in DOS*/ + if ( s->numbr <= LONG_MAX && s->numbr >= -LONG_MAX) + num = (long)s->numbr; + if ((AWKNUM) num == s->numbr) { /* integral value */ + if (num < NVAL && num >= 0) { + sp = values[num]; + s->stlen = 1; + } else { + (void) sprintf(sp, "%ld", num); + s->stlen = strlen(sp); + } + s->stfmt = -1; + } else { + (void) sprintf(sp, CONVFMT, s->numbr); + s->stlen = strlen(sp); + s->stfmt = (char)CONVFMTidx; + } + s->stref = 1; + emalloc(s->stptr, char *, s->stlen + 2, "force_string"); + memcpy(s->stptr, sp, s->stlen+1); + s->flags |= STR; + return s; +} + +/* + * Duplicate a node. (For strings, "duplicate" means crank up the + * reference count.) + */ +NODE * +dupnode(n) +NODE *n; +{ + register NODE *r; + + if (n->flags & TEMP) { + n->flags &= ~TEMP; + n->flags |= MALLOC; + return n; + } + if ((n->flags & (MALLOC|STR)) == (MALLOC|STR)) { + if (n->stref < 255) + n->stref++; + return n; + } + getnode(r); + *r = *n; + r->flags &= ~(PERM|TEMP); + r->flags |= MALLOC; + if (n->type == Node_val && (n->flags & STR)) { + r->stref = 1; + emalloc(r->stptr, char *, r->stlen + 2, "dupnode"); + memcpy(r->stptr, n->stptr, r->stlen+1); + } + return r; +} + +/* this allocates a node with defined numbr */ +NODE * +mk_number(x, flags) +AWKNUM x; +unsigned int flags; +{ + register NODE *r; + + getnode(r); + r->type = Node_val; + r->numbr = x; + r->flags = flags; +#ifdef DEBUG + r->stref = 1; + r->stptr = 0; + r->stlen = 0; +#endif + return r; +} + +/* + * Make a string node. + */ +NODE * +make_str_node(s, len, flags) +char *s; +size_t len; +int flags; +{ + register NODE *r; + + getnode(r); + r->type = Node_val; + r->flags = (STRING|STR|MALLOC); + if (flags & ALREADY_MALLOCED) + r->stptr = s; + else { + emalloc(r->stptr, char *, len + 2, s); + memcpy(r->stptr, s, len); + } + r->stptr[len] = '\0'; + + if (flags & SCAN) { /* scan for escape sequences */ + char *pf; + register char *ptm; + register int c; + register char *end; + + end = &(r->stptr[len]); + for (pf = ptm = r->stptr; pf < end;) { + c = *pf++; + if (c == '\\') { + c = parse_escape(&pf); + if (c < 0) { + if (do_lint) + warning("backslash at end of string"); + c = '\\'; + } + *ptm++ = c; + } else + *ptm++ = c; + } + len = ptm - r->stptr; + erealloc(r->stptr, char *, len + 1, "make_str_node"); + r->stptr[len] = '\0'; + r->flags |= PERM; + } + r->stlen = len; + r->stref = 1; + r->stfmt = -1; + + return r; +} + +NODE * +tmp_string(s, len) +char *s; +size_t len; +{ + register NODE *r; + + r = make_string(s, len); + r->flags |= TEMP; + return r; +} + + +#define NODECHUNK 100 + +NODE *nextfree = NULL; + +NODE * +more_nodes() +{ + register NODE *np; + + /* get more nodes and initialize list */ + emalloc(nextfree, NODE *, NODECHUNK * sizeof(NODE), "newnode"); + for (np = nextfree; np < &nextfree[NODECHUNK - 1]; np++) + np->nextp = np + 1; + np->nextp = NULL; + np = nextfree; + nextfree = nextfree->nextp; + return np; +} + +#ifdef DEBUG +void +freenode(it) +NODE *it; +{ +#ifdef MPROF + it->stref = 0; + free((char *) it); +#else /* not MPROF */ + /* add it to head of freelist */ + it->nextp = nextfree; + nextfree = it; +#endif /* not MPROF */ +} +#endif /* DEBUG */ + +void +unref(tmp) +register NODE *tmp; +{ + if (tmp == NULL) + return; + if (tmp->flags & PERM) + return; + if (tmp->flags & (MALLOC|TEMP)) { + tmp->flags &= ~TEMP; + if (tmp->flags & STR) { + if (tmp->stref > 1) { + if (tmp->stref != 255) + tmp->stref--; + return; + } + free(tmp->stptr); + } + freenode(tmp); + } +} + +/* + * Parse a C escape sequence. STRING_PTR points to a variable containing a + * pointer to the string to parse. That pointer is updated past the + * characters we use. The value of the escape sequence is returned. + * + * A negative value means the sequence \ newline was seen, which is supposed to + * be equivalent to nothing at all. + * + * If \ is followed by a null character, we return a negative value and leave + * the string pointer pointing at the null character. + * + * If \ is followed by 000, we return 0 and leave the string pointer after the + * zeros. A value of 0 does not mean end of string. + * + * Posix doesn't allow \x. + */ + +int +parse_escape(string_ptr) +char **string_ptr; +{ + register int c = *(*string_ptr)++; + register int i; + register int count; + + switch (c) { + case 'a': + return BELL; + case 'b': + return '\b'; + case 'f': + return '\f'; + case 'n': + return '\n'; + case 'r': + return '\r'; + case 't': + return '\t'; + case 'v': + return '\v'; + case '\n': + return -2; + case 0: + (*string_ptr)--; + return -1; + case '0': + case '1': + case '2': + case '3': + case '4': + case '5': + case '6': + case '7': + i = c - '0'; + count = 0; + while (++count < 3) { + if ((c = *(*string_ptr)++) >= '0' && c <= '7') { + i *= 8; + i += c - '0'; + } else { + (*string_ptr)--; + break; + } + } + return i; + case 'x': + if (do_lint) { + static int didwarn; + + if (! didwarn) { + didwarn = 1; + warning("Posix does not allow \"\\x\" escapes"); + } + } + if (do_posix) + return ('x'); + i = 0; + while (1) { + if (isxdigit((c = *(*string_ptr)++))) { + i *= 16; + if (isdigit(c)) + i += c - '0'; + else if (isupper(c)) + i += c - 'A' + 10; + else + i += c - 'a' + 10; + } else { + (*string_ptr)--; + break; + } + } + return i; + default: + return c; + } +} diff --git a/gnu/usr.bin/awk/patchlevel.h b/gnu/usr.bin/awk/patchlevel.h new file mode 100644 index 000000000000..c6161a1f274c --- /dev/null +++ b/gnu/usr.bin/awk/patchlevel.h @@ -0,0 +1 @@ +#define PATCHLEVEL 2 diff --git a/gnu/usr.bin/awk/protos.h b/gnu/usr.bin/awk/protos.h new file mode 100644 index 000000000000..25af32165b02 --- /dev/null +++ b/gnu/usr.bin/awk/protos.h @@ -0,0 +1,115 @@ +/* + * protos.h -- function prototypes for when the headers don't have them. + */ + +/* + * Copyright (C) 1991, 1992, the Free Software Foundation, Inc. + * + * This file is part of GAWK, the GNU implementation of the + * AWK Progamming Language. + * + * GAWK is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * GAWK is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with GAWK; see the file COPYING. If not, write to + * the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +#ifdef __STDC__ +#define aptr_t void * /* arbitrary pointer type */ +#else +#define aptr_t char * +#endif +extern aptr_t malloc P((MALLOC_ARG_T)); +extern aptr_t realloc P((aptr_t, MALLOC_ARG_T)); +extern aptr_t calloc P((MALLOC_ARG_T, MALLOC_ARG_T)); + +extern void free P((aptr_t)); +extern char *getenv P((char *)); + +extern char *strcpy P((char *, const char *)); +extern char *strcat P((char *, const char *)); +extern char *strncpy P((char *, const char *, int)); +extern int strcmp P((const char *, const char *)); +extern int strncmp P((const char *, const char *, int)); +#ifndef VMS +extern char *strerror P((int)); +#else +extern char *strerror P((int,...)); +#endif +extern char *strchr P((const char *, int)); +extern char *strrchr P((const char *, int)); +extern char *strstr P((const char *s1, const char *s2)); +extern int strlen P((const char *)); +extern long strtol P((const char *, char **, int)); +#if !defined(_MSC_VER) && !defined(__GNU_LIBRARY__) +extern int strftime P((char *, int, const char *, const struct tm *)); +#endif +extern time_t time P((time_t *)); +extern aptr_t memset P((aptr_t, int, size_t)); +extern aptr_t memcpy P((aptr_t, const aptr_t, size_t)); +extern aptr_t memmove P((aptr_t, const aptr_t, size_t)); +extern aptr_t memchr P((const aptr_t, int, size_t)); +extern int memcmp P((const aptr_t, const aptr_t, size_t)); + +/* extern int fprintf P((FILE *, char *, ...)); */ +extern int fprintf P(()); +#if !defined(MSDOS) && !defined(__GNU_LIBRARY__) +extern int fwrite P((const char *, int, int, FILE *)); +extern int fputs P((const char *, FILE *)); +extern int unlink P((const char *)); +#endif +extern int fflush P((FILE *)); +extern int fclose P((FILE *)); +extern FILE *popen P((const char *, const char *)); +extern int pclose P((FILE *)); +extern void abort P(()); +extern int isatty P((int)); +extern void exit P((int)); +extern int system P((const char *)); +extern int sscanf P((/* char *, char *, ... */)); +#ifndef toupper +extern int toupper P((int)); +#endif +#ifndef tolower +extern int tolower P((int)); +#endif + +extern double pow P((double x, double y)); +extern double atof P((char *)); +extern double strtod P((const char *, char **)); +extern int fstat P((int, struct stat *)); +extern int stat P((const char *, struct stat *)); +extern off_t lseek P((int, off_t, int)); +extern int fseek P((FILE *, long, int)); +extern int close P((int)); +extern int creat P(()); +extern int open P(()); +extern int pipe P((int *)); +extern int dup P((int)); +extern int dup2 P((int,int)); +extern int fork P(()); +extern int execl P((/* char *, char *, ... */)); +extern int read P((int, char *, int)); +extern int wait P((int *)); +extern void _exit P((int)); + +#ifndef __STDC__ +extern long time P((long *)); +#endif + +#ifdef NON_STD_SPRINTF +extern char *sprintf(); +#else +extern int sprintf(); +#endif /* SPRINTF_INT */ + +#undef aptr_t diff --git a/gnu/usr.bin/awk/re.c b/gnu/usr.bin/awk/re.c new file mode 100644 index 000000000000..495b0963cadb --- /dev/null +++ b/gnu/usr.bin/awk/re.c @@ -0,0 +1,208 @@ +/* + * re.c - compile regular expressions. + */ + +/* + * Copyright (C) 1991, 1992 the Free Software Foundation, Inc. + * + * This file is part of GAWK, the GNU implementation of the + * AWK Progamming Language. + * + * GAWK is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * GAWK is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with GAWK; see the file COPYING. If not, write to + * the Free Software Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. + */ + +#include "awk.h" + +/* Generate compiled regular expressions */ + +Regexp * +make_regexp(s, len, ignorecase, dfa) +char *s; +int len; +int ignorecase; +int dfa; +{ + Regexp *rp; + char *err; + char *src = s; + char *temp; + char *end = s + len; + register char *dest; + register int c; + + /* Handle escaped characters first. */ + + /* Build a copy of the string (in dest) with the + escaped characters translated, and generate the regex + from that. + */ + emalloc(dest, char *, len + 2, "make_regexp"); + temp = dest; + + while (src < end) { + if (*src == '\\') { + c = *++src; + switch (c) { + case 'a': + case 'b': + case 'f': + case 'n': + case 'r': + case 't': + case 'v': + case 'x': + case '0': + case '1': + case '2': + case '3': + case '4': + case '5': + case '6': + case '7': + c = parse_escape(&src); + if (c < 0) + cant_happen(); + *dest++ = (char)c; + break; + default: + *dest++ = '\\'; + *dest++ = (char)c; + src++; + break; + } /* switch */ + } else { + *dest++ = *src++; /* not '\\' */ + } + } /* for */ + + *dest = '\0' ; /* Only necessary if we print dest ? */ + emalloc(rp, Regexp *, sizeof(*rp), "make_regexp"); + memset((char *) rp, 0, sizeof(*rp)); + emalloc(rp->pat.buffer, char *, 16, "make_regexp"); + rp->pat.allocated = 16; + emalloc(rp->pat.fastmap, char *, 256, "make_regexp"); + + if (ignorecase) + rp->pat.translate = casetable; + else + rp->pat.translate = NULL; + len = dest - temp; + if ((err = re_compile_pattern(temp, (size_t) len, &(rp->pat))) != NULL) + fatal("%s: /%s/", err, temp); + if (dfa && !ignorecase) { + regcompile(temp, len, &(rp->dfareg), 1); + rp->dfa = 1; + } else + rp->dfa = 0; + free(temp); + return rp; +} + +int +research(rp, str, start, len, need_start) +Regexp *rp; +register char *str; +int start; +register int len; +int need_start; +{ + char *ret = str; + + if (rp->dfa) { + char save1; + char save2; + int count = 0; + int try_backref; + + save1 = str[start+len]; + str[start+len] = '\n'; + save2 = str[start+len+1]; + ret = regexecute(&(rp->dfareg), str+start, str+start+len+1, 1, + &count, &try_backref); + str[start+len] = save1; + str[start+len+1] = save2; + } + if (ret) { + if (need_start || rp->dfa == 0) + return re_search(&(rp->pat), str, start+len, start, + len, &(rp->regs)); + else + return 1; + } else + return -1; +} + +void +refree(rp) +Regexp *rp; +{ + free(rp->pat.buffer); + free(rp->pat.fastmap); + if (rp->dfa) + reg_free(&(rp->dfareg)); + free(rp); +} + +void +reg_error(s) +const char *s; +{ + fatal(s); +} + +Regexp * +re_update(t) +NODE *t; +{ + NODE *t1; + +# define CASE 1 + if ((t->re_flags & CASE) == IGNORECASE) { + if (t->re_flags & CONST) + return t->re_reg; + t1 = force_string(tree_eval(t->re_exp)); + if (t->re_text) { + if (cmp_nodes(t->re_text, t1) == 0) { + free_temp(t1); + return t->re_reg; + } + unref(t->re_text); + } + t->re_text = dupnode(t1); + free_temp(t1); + } + if (t->re_reg) + refree(t->re_reg); + if (t->re_cnt) + t->re_cnt++; + if (t->re_cnt > 10) + t->re_cnt = 0; + if (!t->re_text) { + t1 = force_string(tree_eval(t->re_exp)); + t->re_text = dupnode(t1); + free_temp(t1); + } + t->re_reg = make_regexp(t->re_text->stptr, t->re_text->stlen, IGNORECASE, t->re_cnt); + t->re_flags &= ~CASE; + t->re_flags |= IGNORECASE; + return t->re_reg; +} + +void +resetup() +{ + (void) re_set_syntax(RE_SYNTAX_AWK); + regsyntax(RE_SYNTAX_AWK, 0); +} diff --git a/gnu/usr.bin/awk/regex.c b/gnu/usr.bin/awk/regex.c new file mode 100644 index 000000000000..f4dd4c2cd24d --- /dev/null +++ b/gnu/usr.bin/awk/regex.c @@ -0,0 +1,2854 @@ +/* Extended regular expression matching and search library. + Copyright (C) 1985, 1989-90 Free Software Foundation, Inc. + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 1, or (at your option) + any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ + + +/* To test, compile with -Dtest. This Dtestable feature turns this into + a self-contained program which reads a pattern, describes how it + compiles, then reads a string and searches for it. + + On the other hand, if you compile with both -Dtest and -Dcanned you + can run some tests we've already thought of. */ + + +#ifdef emacs + +/* The `emacs' switch turns on certain special matching commands + that make sense only in emacs. */ + +#include "lisp.h" +#include "buffer.h" +#include "syntax.h" + +/* We write fatal error messages on standard error. */ +#include <stdio.h> + +/* isalpha(3) etc. are used for the character classes. */ +#include <ctype.h> + +#else /* not emacs */ + +#include "awk.h" + +#define NO_ALLOCA /* try it out for now */ +#ifndef NO_ALLOCA +/* Make alloca work the best possible way. */ +#ifdef __GNUC__ +#ifndef atarist +#ifndef alloca +#define alloca __builtin_alloca +#endif +#endif /* atarist */ +#else +#if defined(sparc) && !defined(__GNUC__) +#include <alloca.h> +#else +char *alloca (); +#endif +#endif /* __GNUC__ */ + +#define FREE_AND_RETURN_VOID(stackb) return +#define FREE_AND_RETURN(stackb,val) return(val) +#define DOUBLE_STACK(stackx,stackb,len) \ + (stackx = (unsigned char **) alloca (2 * len \ + * sizeof (unsigned char *)),\ + /* Only copy what is in use. */ \ + (unsigned char **) memcpy (stackx, stackb, len * sizeof (char *))) +#else /* NO_ALLOCA defined */ +#define FREE_AND_RETURN_VOID(stackb) free(stackb);return +#define FREE_AND_RETURN(stackb,val) free(stackb);return(val) +#define DOUBLE_STACK(stackx,stackb,len) \ + (unsigned char **) realloc (stackb, 2 * len * sizeof (unsigned char *)) +#endif /* NO_ALLOCA */ + +static void store_jump P((char *, int, char *)); +static void insert_jump P((int, char *, char *, char *)); +static void store_jump_n P((char *, int, char *, unsigned)); +static void insert_jump_n P((int, char *, char *, char *, unsigned)); +static void insert_op_2 P((int, char *, char *, int, int )); +static int memcmp_translate P((unsigned char *, unsigned char *, + int, unsigned char *)); +long re_set_syntax P((long)); + +/* Define the syntax stuff, so we can do the \<, \>, etc. */ + +/* This must be nonzero for the wordchar and notwordchar pattern + commands in re_match_2. */ +#ifndef Sword +#define Sword 1 +#endif + +#define SYNTAX(c) re_syntax_table[c] + + +#ifdef SYNTAX_TABLE + +char *re_syntax_table; + +#else /* not SYNTAX_TABLE */ + +static char re_syntax_table[256]; +static void init_syntax_once P((void)); + + +static void +init_syntax_once () +{ + register int c; + static int done = 0; + + if (done) + return; + + memset (re_syntax_table, 0, sizeof re_syntax_table); + + for (c = 'a'; c <= 'z'; c++) + re_syntax_table[c] = Sword; + + for (c = 'A'; c <= 'Z'; c++) + re_syntax_table[c] = Sword; + + for (c = '0'; c <= '9'; c++) + re_syntax_table[c] = Sword; + + /* Add specific syntax for ISO Latin-1. */ + for (c = 0300; c <= 0377; c++) + re_syntax_table[c] = Sword; + re_syntax_table[0327] = 0; + re_syntax_table[0367] = 0; + + done = 1; +} + +#endif /* SYNTAX_TABLE */ +#undef P +#endif /* emacs */ + + +/* Sequents are missing isgraph. */ +#ifndef isgraph +#define isgraph(c) (isprint((c)) && !isspace((c))) +#endif + +/* Get the interface, including the syntax bits. */ +#include "regex.h" + + +/* These are the command codes that appear in compiled regular + expressions, one per byte. Some command codes are followed by + argument bytes. A command code can specify any interpretation + whatsoever for its arguments. Zero-bytes may appear in the compiled + regular expression. + + The value of `exactn' is needed in search.c (search_buffer) in emacs. + So regex.h defines a symbol `RE_EXACTN_VALUE' to be 1; the value of + `exactn' we use here must also be 1. */ + +enum regexpcode + { + unused=0, + exactn=1, /* Followed by one byte giving n, then by n literal bytes. */ + begline, /* Fail unless at beginning of line. */ + endline, /* Fail unless at end of line. */ + jump, /* Followed by two bytes giving relative address to jump to. */ + on_failure_jump, /* Followed by two bytes giving relative address of + place to resume at in case of failure. */ + finalize_jump, /* Throw away latest failure point and then jump to + address. */ + maybe_finalize_jump, /* Like jump but finalize if safe to do so. + This is used to jump back to the beginning + of a repeat. If the command that follows + this jump is clearly incompatible with the + one at the beginning of the repeat, such that + we can be sure that there is no use backtracking + out of repetitions already completed, + then we finalize. */ + dummy_failure_jump, /* Jump, and push a dummy failure point. This + failure point will be thrown away if an attempt + is made to use it for a failure. A + construct + makes this before the first repeat. Also + use it as an intermediary kind of jump when + compiling an or construct. */ + succeed_n, /* Used like on_failure_jump except has to succeed n times; + then gets turned into an on_failure_jump. The relative + address following it is useless until then. The + address is followed by two bytes containing n. */ + jump_n, /* Similar to jump, but jump n times only; also the relative + address following is in turn followed by yet two more bytes + containing n. */ + set_number_at, /* Set the following relative location to the + subsequent number. */ + anychar, /* Matches any (more or less) one character. */ + charset, /* Matches any one char belonging to specified set. + First following byte is number of bitmap bytes. + Then come bytes for a bitmap saying which chars are in. + Bits in each byte are ordered low-bit-first. + A character is in the set if its bit is 1. + A character too large to have a bit in the map + is automatically not in the set. */ + charset_not, /* Same parameters as charset, but match any character + that is not one of those specified. */ + start_memory, /* Start remembering the text that is matched, for + storing in a memory register. Followed by one + byte containing the register number. Register numbers + must be in the range 0 through RE_NREGS. */ + stop_memory, /* Stop remembering the text that is matched + and store it in a memory register. Followed by + one byte containing the register number. Register + numbers must be in the range 0 through RE_NREGS. */ + duplicate, /* Match a duplicate of something remembered. + Followed by one byte containing the index of the memory + register. */ + before_dot, /* Succeeds if before point. */ + at_dot, /* Succeeds if at point. */ + after_dot, /* Succeeds if after point. */ + begbuf, /* Succeeds if at beginning of buffer. */ + endbuf, /* Succeeds if at end of buffer. */ + wordchar, /* Matches any word-constituent character. */ + notwordchar, /* Matches any char that is not a word-constituent. */ + wordbeg, /* Succeeds if at word beginning. */ + wordend, /* Succeeds if at word end. */ + wordbound, /* Succeeds if at a word boundary. */ + notwordbound,/* Succeeds if not at a word boundary. */ + syntaxspec, /* Matches any character whose syntax is specified. + followed by a byte which contains a syntax code, + e.g., Sword. */ + notsyntaxspec /* Matches any character whose syntax differs from + that specified. */ + }; + + +/* Number of failure points to allocate space for initially, + when matching. If this number is exceeded, more space is allocated, + so it is not a hard limit. */ + +#ifndef NFAILURES +#define NFAILURES 80 +#endif + +#ifdef CHAR_UNSIGNED +#define SIGN_EXTEND_CHAR(c) ((c)>(char)127?(c)-256:(c)) /* for IBM RT */ +#endif +#ifndef SIGN_EXTEND_CHAR +#define SIGN_EXTEND_CHAR(x) (x) +#endif + + +/* Store NUMBER in two contiguous bytes starting at DESTINATION. */ +#define STORE_NUMBER(destination, number) \ + { (destination)[0] = (number) & 0377; \ + (destination)[1] = (number) >> 8; } + +/* Same as STORE_NUMBER, except increment the destination pointer to + the byte after where the number is stored. Watch out that values for + DESTINATION such as p + 1 won't work, whereas p will. */ +#define STORE_NUMBER_AND_INCR(destination, number) \ + { STORE_NUMBER(destination, number); \ + (destination) += 2; } + + +/* Put into DESTINATION a number stored in two contingous bytes starting + at SOURCE. */ +#define EXTRACT_NUMBER(destination, source) \ + { (destination) = *(source) & 0377; \ + (destination) += SIGN_EXTEND_CHAR (*(char *)((source) + 1)) << 8; } + +/* Same as EXTRACT_NUMBER, except increment the pointer for source to + point to second byte of SOURCE. Note that SOURCE has to be a value + such as p, not, e.g., p + 1. */ +#define EXTRACT_NUMBER_AND_INCR(destination, source) \ + { EXTRACT_NUMBER (destination, source); \ + (source) += 2; } + + +/* Specify the precise syntax of regexps for compilation. This provides + for compatibility for various utilities which historically have + different, incompatible syntaxes. + + The argument SYNTAX is a bit-mask comprised of the various bits + defined in regex.h. */ + +long +re_set_syntax (syntax) + long syntax; +{ + long ret; + + ret = obscure_syntax; + obscure_syntax = syntax; + return ret; +} + +/* Set by re_set_syntax to the current regexp syntax to recognize. */ +long obscure_syntax = 0; + + + +/* Macros for re_compile_pattern, which is found below these definitions. */ + +#define CHAR_CLASS_MAX_LENGTH 6 + +/* Fetch the next character in the uncompiled pattern, translating it if + necessary. */ +#define PATFETCH(c) \ + {if (p == pend) goto end_of_pattern; \ + c = * (unsigned char *) p++; \ + if (translate) c = translate[c]; } + +/* Fetch the next character in the uncompiled pattern, with no + translation. */ +#define PATFETCH_RAW(c) \ + {if (p == pend) goto end_of_pattern; \ + c = * (unsigned char *) p++; } + +#define PATUNFETCH p-- + + +/* If the buffer isn't allocated when it comes in, use this. */ +#define INIT_BUF_SIZE 28 + +/* Make sure we have at least N more bytes of space in buffer. */ +#define GET_BUFFER_SPACE(n) \ + { \ + while (b - bufp->buffer + (n) >= bufp->allocated) \ + EXTEND_BUFFER; \ + } + +/* Make sure we have one more byte of buffer space and then add CH to it. */ +#define BUFPUSH(ch) \ + { \ + GET_BUFFER_SPACE (1); \ + *b++ = (char) (ch); \ + } + +/* Extend the buffer by twice its current size via reallociation and + reset the pointers that pointed into the old allocation to point to + the correct places in the new allocation. If extending the buffer + results in it being larger than 1 << 16, then flag memory exhausted. */ +#define EXTEND_BUFFER \ + { char *old_buffer = bufp->buffer; \ + if (bufp->allocated == (1L<<16)) goto too_big; \ + bufp->allocated *= 2; \ + if (bufp->allocated > (1L<<16)) bufp->allocated = (1L<<16); \ + bufp->buffer = (char *) realloc (bufp->buffer, bufp->allocated); \ + if (bufp->buffer == 0) \ + goto memory_exhausted; \ + b = (b - old_buffer) + bufp->buffer; \ + if (fixup_jump) \ + fixup_jump = (fixup_jump - old_buffer) + bufp->buffer; \ + if (laststart) \ + laststart = (laststart - old_buffer) + bufp->buffer; \ + begalt = (begalt - old_buffer) + bufp->buffer; \ + if (pending_exact) \ + pending_exact = (pending_exact - old_buffer) + bufp->buffer; \ + } + +/* Set the bit for character C in a character set list. */ +#define SET_LIST_BIT(c) (b[(c) / BYTEWIDTH] |= 1 << ((c) % BYTEWIDTH)) + +/* Get the next unsigned number in the uncompiled pattern. */ +#define GET_UNSIGNED_NUMBER(num) \ + { if (p != pend) \ + { \ + PATFETCH (c); \ + while (isdigit (c)) \ + { \ + if (num < 0) \ + num = 0; \ + num = num * 10 + c - '0'; \ + if (p == pend) \ + break; \ + PATFETCH (c); \ + } \ + } \ + } + +/* Subroutines for re_compile_pattern. */ +/* static void store_jump (), insert_jump (), store_jump_n (), + insert_jump_n (), insert_op_2 (); */ + + +/* re_compile_pattern takes a regular-expression string + and converts it into a buffer full of byte commands for matching. + + PATTERN is the address of the pattern string + SIZE is the length of it. + BUFP is a struct re_pattern_buffer * which points to the info + on where to store the byte commands. + This structure contains a char * which points to the + actual space, which should have been obtained with malloc. + re_compile_pattern may use realloc to grow the buffer space. + + The number of bytes of commands can be found out by looking in + the `struct re_pattern_buffer' that bufp pointed to, after + re_compile_pattern returns. */ + +char * +re_compile_pattern (pattern, size, bufp) + char *pattern; + size_t size; + struct re_pattern_buffer *bufp; +{ + register char *b = bufp->buffer; + register char *p = pattern; + char *pend = pattern + size; + register unsigned c, c1; + char *p0; + unsigned char *translate = (unsigned char *) bufp->translate; + + /* Address of the count-byte of the most recently inserted `exactn' + command. This makes it possible to tell whether a new exact-match + character can be added to that command or requires a new `exactn' + command. */ + + char *pending_exact = 0; + + /* Address of the place where a forward-jump should go to the end of + the containing expression. Each alternative of an `or', except the + last, ends with a forward-jump of this sort. */ + + char *fixup_jump = 0; + + /* Address of start of the most recently finished expression. + This tells postfix * where to find the start of its operand. */ + + char *laststart = 0; + + /* In processing a repeat, 1 means zero matches is allowed. */ + + char zero_times_ok; + + /* In processing a repeat, 1 means many matches is allowed. */ + + char many_times_ok; + + /* Address of beginning of regexp, or inside of last \(. */ + + char *begalt = b; + + /* In processing an interval, at least this many matches must be made. */ + int lower_bound; + + /* In processing an interval, at most this many matches can be made. */ + int upper_bound; + + /* Place in pattern (i.e., the {) to which to go back if the interval + is invalid. */ + char *beg_interval = 0; + + /* Stack of information saved by \( and restored by \). + Four stack elements are pushed by each \(: + First, the value of b. + Second, the value of fixup_jump. + Third, the value of regnum. + Fourth, the value of begalt. */ + + int stackb[40]; + int *stackp = stackb; + int *stacke = stackb + 40; + int *stackt; + + /* Counts \('s as they are encountered. Remembered for the matching \), + where it becomes the register number to put in the stop_memory + command. */ + + int regnum = 1; + + bufp->fastmap_accurate = 0; + +#ifndef emacs +#ifndef SYNTAX_TABLE + /* Initialize the syntax table. */ + init_syntax_once(); +#endif +#endif + + if (bufp->allocated == 0) + { + bufp->allocated = INIT_BUF_SIZE; + if (bufp->buffer) + /* EXTEND_BUFFER loses when bufp->allocated is 0. */ + bufp->buffer = (char *) realloc (bufp->buffer, INIT_BUF_SIZE); + else + /* Caller did not allocate a buffer. Do it for them. */ + bufp->buffer = (char *) malloc (INIT_BUF_SIZE); + if (!bufp->buffer) goto memory_exhausted; + begalt = b = bufp->buffer; + } + + while (p != pend) + { + PATFETCH (c); + + switch (c) + { + case '$': + { + char *p1 = p; + /* When testing what follows the $, + look past the \-constructs that don't consume anything. */ + if (! (obscure_syntax & RE_CONTEXT_INDEP_OPS)) + while (p1 != pend) + { + if (*p1 == '\\' && p1 + 1 != pend + && (p1[1] == '<' || p1[1] == '>' + || p1[1] == '`' || p1[1] == '\'' +#ifdef emacs + || p1[1] == '=' +#endif + || p1[1] == 'b' || p1[1] == 'B')) + p1 += 2; + else + break; + } + if (obscure_syntax & RE_TIGHT_VBAR) + { + if (! (obscure_syntax & RE_CONTEXT_INDEP_OPS) && p1 != pend) + goto normal_char; + /* Make operand of last vbar end before this `$'. */ + if (fixup_jump) + store_jump (fixup_jump, jump, b); + fixup_jump = 0; + BUFPUSH (endline); + break; + } + /* $ means succeed if at end of line, but only in special contexts. + If validly in the middle of a pattern, it is a normal character. */ + + if ((obscure_syntax & RE_CONTEXTUAL_INVALID_OPS) && p1 != pend) + goto invalid_pattern; + if (p1 == pend || *p1 == '\n' + || (obscure_syntax & RE_CONTEXT_INDEP_OPS) + || (obscure_syntax & RE_NO_BK_PARENS + ? *p1 == ')' + : *p1 == '\\' && p1[1] == ')') + || (obscure_syntax & RE_NO_BK_VBAR + ? *p1 == '|' + : *p1 == '\\' && p1[1] == '|')) + { + BUFPUSH (endline); + break; + } + goto normal_char; + } + case '^': + /* ^ means succeed if at beg of line, but only if no preceding + pattern. */ + + if ((obscure_syntax & RE_CONTEXTUAL_INVALID_OPS) && laststart) + goto invalid_pattern; + if (laststart && p - 2 >= pattern && p[-2] != '\n' + && !(obscure_syntax & RE_CONTEXT_INDEP_OPS)) + goto normal_char; + if (obscure_syntax & RE_TIGHT_VBAR) + { + if (p != pattern + 1 + && ! (obscure_syntax & RE_CONTEXT_INDEP_OPS)) + goto normal_char; + BUFPUSH (begline); + begalt = b; + } + else + BUFPUSH (begline); + break; + + case '+': + case '?': + if ((obscure_syntax & RE_BK_PLUS_QM) + || (obscure_syntax & RE_LIMITED_OPS)) + goto normal_char; + handle_plus: + case '*': + /* If there is no previous pattern, char not special. */ + if (!laststart) + { + if (obscure_syntax & RE_CONTEXTUAL_INVALID_OPS) + goto invalid_pattern; + else if (! (obscure_syntax & RE_CONTEXT_INDEP_OPS)) + goto normal_char; + } + /* If there is a sequence of repetition chars, + collapse it down to just one. */ + zero_times_ok = 0; + many_times_ok = 0; + while (1) + { + zero_times_ok |= c != '+'; + many_times_ok |= c != '?'; + if (p == pend) + break; + PATFETCH (c); + if (c == '*') + ; + else if (!(obscure_syntax & RE_BK_PLUS_QM) + && (c == '+' || c == '?')) + ; + else if ((obscure_syntax & RE_BK_PLUS_QM) + && c == '\\') + { + /* int c1; */ + PATFETCH (c1); + if (!(c1 == '+' || c1 == '?')) + { + PATUNFETCH; + PATUNFETCH; + break; + } + c = c1; + } + else + { + PATUNFETCH; + break; + } + } + + /* Star, etc. applied to an empty pattern is equivalent + to an empty pattern. */ + if (!laststart) + break; + + /* Now we know whether or not zero matches is allowed + and also whether or not two or more matches is allowed. */ + if (many_times_ok) + { + /* If more than one repetition is allowed, put in at the + end a backward relative jump from b to before the next + jump we're going to put in below (which jumps from + laststart to after this jump). */ + GET_BUFFER_SPACE (3); + store_jump (b, maybe_finalize_jump, laststart - 3); + b += 3; /* Because store_jump put stuff here. */ + } + /* On failure, jump from laststart to b + 3, which will be the + end of the buffer after this jump is inserted. */ + GET_BUFFER_SPACE (3); + insert_jump (on_failure_jump, laststart, b + 3, b); + pending_exact = 0; + b += 3; + if (!zero_times_ok) + { + /* At least one repetition is required, so insert a + dummy-failure before the initial on-failure-jump + instruction of the loop. This effects a skip over that + instruction the first time we hit that loop. */ + GET_BUFFER_SPACE (6); + insert_jump (dummy_failure_jump, laststart, laststart + 6, b); + b += 3; + } + break; + + case '.': + laststart = b; + BUFPUSH (anychar); + break; + + case '[': + if (p == pend) + goto invalid_pattern; + while (b - bufp->buffer + > bufp->allocated - 3 - (1 << BYTEWIDTH) / BYTEWIDTH) + EXTEND_BUFFER; + + laststart = b; + if (*p == '^') + { + BUFPUSH (charset_not); + p++; + } + else + BUFPUSH (charset); + p0 = p; + + BUFPUSH ((1 << BYTEWIDTH) / BYTEWIDTH); + /* Clear the whole map */ + memset (b, 0, (1 << BYTEWIDTH) / BYTEWIDTH); + + if ((obscure_syntax & RE_HAT_NOT_NEWLINE) && b[-2] == charset_not) + SET_LIST_BIT ('\n'); + + + /* Read in characters and ranges, setting map bits. */ + while (1) + { + /* Don't translate while fetching, in case it's a range bound. + When we set the bit for the character, we translate it. */ + PATFETCH_RAW (c); + + /* If set, \ escapes characters when inside [...]. */ + if ((obscure_syntax & RE_AWK_CLASS_HACK) && c == '\\') + { + PATFETCH(c1); + SET_LIST_BIT (c1); + continue; + } + if (c == ']') + { + if (p == p0 + 1) + { + /* If this is an empty bracket expression. */ + if ((obscure_syntax & RE_NO_EMPTY_BRACKETS) + && p == pend) + goto invalid_pattern; + } + else + /* Stop if this isn't merely a ] inside a bracket + expression, but rather the end of a bracket + expression. */ + break; + } + /* Get a range. */ + if (p[0] == '-' && p[1] != ']') + { + PATFETCH (c1); + /* Don't translate the range bounds while fetching them. */ + PATFETCH_RAW (c1); + + if ((obscure_syntax & RE_NO_EMPTY_RANGES) && c > c1) + goto invalid_pattern; + + if ((obscure_syntax & RE_NO_HYPHEN_RANGE_END) + && c1 == '-' && *p != ']') + goto invalid_pattern; + + while (c <= c1) + { + /* Translate each char that's in the range. */ + if (translate) + SET_LIST_BIT (translate[c]); + else + SET_LIST_BIT (c); + c++; + } + } + else if ((obscure_syntax & RE_CHAR_CLASSES) + && c == '[' && p[0] == ':') + { + /* Longest valid character class word has six characters. */ + char str[CHAR_CLASS_MAX_LENGTH]; + PATFETCH (c); + c1 = 0; + /* If no ] at end. */ + if (p == pend) + goto invalid_pattern; + while (1) + { + /* Don't translate the ``character class'' characters. */ + PATFETCH_RAW (c); + if (c == ':' || c == ']' || p == pend + || c1 == CHAR_CLASS_MAX_LENGTH) + break; + str[c1++] = c; + } + str[c1] = '\0'; + if (p == pend + || c == ']' /* End of the bracket expression. */ + || p[0] != ']' + || p + 1 == pend + || (strcmp (str, "alpha") != 0 + && strcmp (str, "upper") != 0 + && strcmp (str, "lower") != 0 + && strcmp (str, "digit") != 0 + && strcmp (str, "alnum") != 0 + && strcmp (str, "xdigit") != 0 + && strcmp (str, "space") != 0 + && strcmp (str, "print") != 0 + && strcmp (str, "punct") != 0 + && strcmp (str, "graph") != 0 + && strcmp (str, "cntrl") != 0)) + { + /* Undo the ending character, the letters, and leave + the leading : and [ (but set bits for them). */ + c1++; + while (c1--) + PATUNFETCH; + SET_LIST_BIT ('['); + SET_LIST_BIT (':'); + } + else + { + /* The ] at the end of the character class. */ + PATFETCH (c); + if (c != ']') + goto invalid_pattern; + for (c = 0; c < (1 << BYTEWIDTH); c++) + { + if ((strcmp (str, "alpha") == 0 && isalpha (c)) + || (strcmp (str, "upper") == 0 && isupper (c)) + || (strcmp (str, "lower") == 0 && islower (c)) + || (strcmp (str, "digit") == 0 && isdigit (c)) + || (strcmp (str, "alnum") == 0 && isalnum (c)) + || (strcmp (str, "xdigit") == 0 && isxdigit (c)) + || (strcmp (str, "space") == 0 && isspace (c)) + || (strcmp (str, "print") == 0 && isprint (c)) + || (strcmp (str, "punct") == 0 && ispunct (c)) + || (strcmp (str, "graph") == 0 && isgraph (c)) + || (strcmp (str, "cntrl") == 0 && iscntrl (c))) + SET_LIST_BIT (c); + } + } + } + else if (translate) + SET_LIST_BIT (translate[c]); + else + SET_LIST_BIT (c); + } + + /* Discard any character set/class bitmap bytes that are all + 0 at the end of the map. Decrement the map-length byte too. */ + while ((int) b[-1] > 0 && b[b[-1] - 1] == 0) + b[-1]--; + b += b[-1]; + break; + + case '(': + if (! (obscure_syntax & RE_NO_BK_PARENS)) + goto normal_char; + else + goto handle_open; + + case ')': + if (! (obscure_syntax & RE_NO_BK_PARENS)) + goto normal_char; + else + goto handle_close; + + case '\n': + if (! (obscure_syntax & RE_NEWLINE_OR)) + goto normal_char; + else + goto handle_bar; + + case '|': + if ((obscure_syntax & RE_CONTEXTUAL_INVALID_OPS) + && (! laststart || p == pend)) + goto invalid_pattern; + else if (! (obscure_syntax & RE_NO_BK_VBAR)) + goto normal_char; + else + goto handle_bar; + + case '{': + if (! ((obscure_syntax & RE_NO_BK_CURLY_BRACES) + && (obscure_syntax & RE_INTERVALS))) + goto normal_char; + else + goto handle_interval; + + case '\\': + if (p == pend) goto invalid_pattern; + PATFETCH_RAW (c); + switch (c) + { + case '(': + if (obscure_syntax & RE_NO_BK_PARENS) + goto normal_backsl; + handle_open: + if (stackp == stacke) goto nesting_too_deep; + + /* Laststart should point to the start_memory that we are about + to push (unless the pattern has RE_NREGS or more ('s). */ + *stackp++ = b - bufp->buffer; + if (regnum < RE_NREGS) + { + BUFPUSH (start_memory); + BUFPUSH (regnum); + } + *stackp++ = fixup_jump ? fixup_jump - bufp->buffer + 1 : 0; + *stackp++ = regnum++; + *stackp++ = begalt - bufp->buffer; + fixup_jump = 0; + laststart = 0; + begalt = b; + break; + + case ')': + if (obscure_syntax & RE_NO_BK_PARENS) + goto normal_backsl; + handle_close: + if (stackp == stackb) goto unmatched_close; + begalt = *--stackp + bufp->buffer; + if (fixup_jump) + store_jump (fixup_jump, jump, b); + if (stackp[-1] < RE_NREGS) + { + BUFPUSH (stop_memory); + BUFPUSH (stackp[-1]); + } + stackp -= 2; + fixup_jump = *stackp ? *stackp + bufp->buffer - 1 : 0; + laststart = *--stackp + bufp->buffer; + break; + + case '|': + if ((obscure_syntax & RE_LIMITED_OPS) + || (obscure_syntax & RE_NO_BK_VBAR)) + goto normal_backsl; + handle_bar: + if (obscure_syntax & RE_LIMITED_OPS) + goto normal_char; + /* Insert before the previous alternative a jump which + jumps to this alternative if the former fails. */ + GET_BUFFER_SPACE (6); + insert_jump (on_failure_jump, begalt, b + 6, b); + pending_exact = 0; + b += 3; + /* The alternative before the previous alternative has a + jump after it which gets executed if it gets matched. + Adjust that jump so it will jump to the previous + alternative's analogous jump (put in below, which in + turn will jump to the next (if any) alternative's such + jump, etc.). The last such jump jumps to the correct + final destination. */ + if (fixup_jump) + store_jump (fixup_jump, jump, b); + + /* Leave space for a jump after previous alternative---to be + filled in later. */ + fixup_jump = b; + b += 3; + + laststart = 0; + begalt = b; + break; + + case '{': + if (! (obscure_syntax & RE_INTERVALS) + /* Let \{ be a literal. */ + || ((obscure_syntax & RE_INTERVALS) + && (obscure_syntax & RE_NO_BK_CURLY_BRACES)) + /* If it's the string "\{". */ + || (p - 2 == pattern && p == pend)) + goto normal_backsl; + handle_interval: + beg_interval = p - 1; /* The {. */ + /* If there is no previous pattern, this isn't an interval. */ + if (!laststart) + { + if (obscure_syntax & RE_CONTEXTUAL_INVALID_OPS) + goto invalid_pattern; + else + goto normal_backsl; + } + /* It also isn't an interval if not preceded by an re + matching a single character or subexpression, or if + the current type of intervals can't handle back + references and the previous thing is a back reference. */ + if (! (*laststart == anychar + || *laststart == charset + || *laststart == charset_not + || *laststart == start_memory + || (*laststart == exactn && laststart[1] == 1) + || (! (obscure_syntax & RE_NO_BK_REFS) + && *laststart == duplicate))) + { + if (obscure_syntax & RE_NO_BK_CURLY_BRACES) + goto normal_char; + + /* Posix extended syntax is handled in previous + statement; this is for Posix basic syntax. */ + if (obscure_syntax & RE_INTERVALS) + goto invalid_pattern; + + goto normal_backsl; + } + lower_bound = -1; /* So can see if are set. */ + upper_bound = -1; + GET_UNSIGNED_NUMBER (lower_bound); + if (c == ',') + { + GET_UNSIGNED_NUMBER (upper_bound); + if (upper_bound < 0) + upper_bound = RE_DUP_MAX; + } + if (upper_bound < 0) + upper_bound = lower_bound; + if (! (obscure_syntax & RE_NO_BK_CURLY_BRACES)) + { + if (c != '\\') + goto invalid_pattern; + PATFETCH (c); + } + if (c != '}' || lower_bound < 0 || upper_bound > RE_DUP_MAX + || lower_bound > upper_bound + || ((obscure_syntax & RE_NO_BK_CURLY_BRACES) + && p != pend && *p == '{')) + { + if (obscure_syntax & RE_NO_BK_CURLY_BRACES) + goto unfetch_interval; + else + goto invalid_pattern; + } + + /* If upper_bound is zero, don't want to succeed at all; + jump from laststart to b + 3, which will be the end of + the buffer after this jump is inserted. */ + + if (upper_bound == 0) + { + GET_BUFFER_SPACE (3); + insert_jump (jump, laststart, b + 3, b); + b += 3; + } + + /* Otherwise, after lower_bound number of succeeds, jump + to after the jump_n which will be inserted at the end + of the buffer, and insert that jump_n. */ + else + { /* Set to 5 if only one repetition is allowed and + hence no jump_n is inserted at the current end of + the buffer; then only space for the succeed_n is + needed. Otherwise, need space for both the + succeed_n and the jump_n. */ + + unsigned slots_needed = upper_bound == 1 ? 5 : 10; + + GET_BUFFER_SPACE (slots_needed); + /* Initialize the succeed_n to n, even though it will + be set by its attendant set_number_at, because + re_compile_fastmap will need to know it. Jump to + what the end of buffer will be after inserting + this succeed_n and possibly appending a jump_n. */ + insert_jump_n (succeed_n, laststart, b + slots_needed, + b, lower_bound); + b += 5; /* Just increment for the succeed_n here. */ + + /* More than one repetition is allowed, so put in at + the end of the buffer a backward jump from b to the + succeed_n we put in above. By the time we've gotten + to this jump when matching, we'll have matched once + already, so jump back only upper_bound - 1 times. */ + + if (upper_bound > 1) + { + store_jump_n (b, jump_n, laststart, upper_bound - 1); + b += 5; + /* When hit this when matching, reset the + preceding jump_n's n to upper_bound - 1. */ + BUFPUSH (set_number_at); + GET_BUFFER_SPACE (2); + STORE_NUMBER_AND_INCR (b, -5); + STORE_NUMBER_AND_INCR (b, upper_bound - 1); + } + /* When hit this when matching, set the succeed_n's n. */ + GET_BUFFER_SPACE (5); + insert_op_2 (set_number_at, laststart, b, 5, lower_bound); + b += 5; + } + pending_exact = 0; + beg_interval = 0; + break; + + + unfetch_interval: + /* If an invalid interval, match the characters as literals. */ + if (beg_interval) + p = beg_interval; + else + { + fprintf (stderr, + "regex: no interval beginning to which to backtrack.\n"); + exit (1); + } + + beg_interval = 0; + PATFETCH (c); /* normal_char expects char in `c'. */ + goto normal_char; + break; + +#ifdef emacs + case '=': + BUFPUSH (at_dot); + break; + + case 's': + laststart = b; + BUFPUSH (syntaxspec); + PATFETCH (c); + BUFPUSH (syntax_spec_code[c]); + break; + + case 'S': + laststart = b; + BUFPUSH (notsyntaxspec); + PATFETCH (c); + BUFPUSH (syntax_spec_code[c]); + break; +#endif /* emacs */ + + case 'w': + laststart = b; + BUFPUSH (wordchar); + break; + + case 'W': + laststart = b; + BUFPUSH (notwordchar); + break; + + case '<': + BUFPUSH (wordbeg); + break; + + case '>': + BUFPUSH (wordend); + break; + + case 'b': + BUFPUSH (wordbound); + break; + + case 'B': + BUFPUSH (notwordbound); + break; + + case '`': + BUFPUSH (begbuf); + break; + + case '\'': + BUFPUSH (endbuf); + break; + + case '1': + case '2': + case '3': + case '4': + case '5': + case '6': + case '7': + case '8': + case '9': + if (obscure_syntax & RE_NO_BK_REFS) + goto normal_char; + c1 = c - '0'; + if (c1 >= regnum) + { + if (obscure_syntax & RE_NO_EMPTY_BK_REF) + goto invalid_pattern; + else + goto normal_char; + } + /* Can't back reference to a subexpression if inside of it. */ + for (stackt = stackp - 2; stackt > stackb; stackt -= 4) + if (*stackt == c1) + goto normal_char; + laststart = b; + BUFPUSH (duplicate); + BUFPUSH (c1); + break; + + case '+': + case '?': + if (obscure_syntax & RE_BK_PLUS_QM) + goto handle_plus; + else + goto normal_backsl; + break; + + default: + normal_backsl: + /* You might think it would be useful for \ to mean + not to translate; but if we don't translate it + it will never match anything. */ + if (translate) c = translate[c]; + goto normal_char; + } + break; + + default: + normal_char: /* Expects the character in `c'. */ + if (!pending_exact || pending_exact + *pending_exact + 1 != b + || *pending_exact == 0177 || *p == '*' || *p == '^' + || ((obscure_syntax & RE_BK_PLUS_QM) + ? *p == '\\' && (p[1] == '+' || p[1] == '?') + : (*p == '+' || *p == '?')) + || ((obscure_syntax & RE_INTERVALS) + && ((obscure_syntax & RE_NO_BK_CURLY_BRACES) + ? *p == '{' + : (p[0] == '\\' && p[1] == '{')))) + { + laststart = b; + BUFPUSH (exactn); + pending_exact = b; + BUFPUSH (0); + } + BUFPUSH (c); + (*pending_exact)++; + } + } + + if (fixup_jump) + store_jump (fixup_jump, jump, b); + + if (stackp != stackb) goto unmatched_open; + + bufp->used = b - bufp->buffer; + return 0; + + invalid_pattern: + return "Invalid regular expression"; + + unmatched_open: + return "Unmatched \\("; + + unmatched_close: + return "Unmatched \\)"; + + end_of_pattern: + return "Premature end of regular expression"; + + nesting_too_deep: + return "Nesting too deep"; + + too_big: + return "Regular expression too big"; + + memory_exhausted: + return "Memory exhausted"; +} + + +/* Store a jump of the form <OPCODE> <relative address>. + Store in the location FROM a jump operation to jump to relative + address FROM - TO. OPCODE is the opcode to store. */ + +static void +store_jump (from, opcode, to) + char *from, *to; + int opcode; +{ + from[0] = (char)opcode; + STORE_NUMBER(from + 1, to - (from + 3)); +} + + +/* Open up space before char FROM, and insert there a jump to TO. + CURRENT_END gives the end of the storage not in use, so we know + how much data to copy up. OP is the opcode of the jump to insert. + + If you call this function, you must zero out pending_exact. */ + +static void +insert_jump (op, from, to, current_end) + int op; + char *from, *to, *current_end; +{ + register char *pfrom = current_end; /* Copy from here... */ + register char *pto = current_end + 3; /* ...to here. */ + + while (pfrom != from) + *--pto = *--pfrom; + store_jump (from, op, to); +} + + +/* Store a jump of the form <opcode> <relative address> <n> . + + Store in the location FROM a jump operation to jump to relative + address FROM - TO. OPCODE is the opcode to store, N is a number the + jump uses, say, to decide how many times to jump. + + If you call this function, you must zero out pending_exact. */ + +static void +store_jump_n (from, opcode, to, n) + char *from, *to; + int opcode; + unsigned n; +{ + from[0] = (char)opcode; + STORE_NUMBER (from + 1, to - (from + 3)); + STORE_NUMBER (from + 3, n); +} + + +/* Similar to insert_jump, but handles a jump which needs an extra + number to handle minimum and maximum cases. Open up space at + location FROM, and insert there a jump to TO. CURRENT_END gives the + end of the storage in use, so we know how much data to copy up. OP is + the opcode of the jump to insert. + + If you call this function, you must zero out pending_exact. */ + +static void +insert_jump_n (op, from, to, current_end, n) + int op; + char *from, *to, *current_end; + unsigned n; +{ + register char *pfrom = current_end; /* Copy from here... */ + register char *pto = current_end + 5; /* ...to here. */ + + while (pfrom != from) + *--pto = *--pfrom; + store_jump_n (from, op, to, n); +} + + +/* Open up space at location THERE, and insert operation OP followed by + NUM_1 and NUM_2. CURRENT_END gives the end of the storage in use, so + we know how much data to copy up. + + If you call this function, you must zero out pending_exact. */ + +static void +insert_op_2 (op, there, current_end, num_1, num_2) + int op; + char *there, *current_end; + int num_1, num_2; +{ + register char *pfrom = current_end; /* Copy from here... */ + register char *pto = current_end + 5; /* ...to here. */ + + while (pfrom != there) + *--pto = *--pfrom; + + there[0] = (char)op; + STORE_NUMBER (there + 1, num_1); + STORE_NUMBER (there + 3, num_2); +} + + + +/* Given a pattern, compute a fastmap from it. The fastmap records + which of the (1 << BYTEWIDTH) possible characters can start a string + that matches the pattern. This fastmap is used by re_search to skip + quickly over totally implausible text. + + The caller must supply the address of a (1 << BYTEWIDTH)-byte data + area as bufp->fastmap. + The other components of bufp describe the pattern to be used. */ + +void +re_compile_fastmap (bufp) + struct re_pattern_buffer *bufp; +{ + unsigned char *pattern = (unsigned char *) bufp->buffer; + int size = bufp->used; + register char *fastmap = bufp->fastmap; + register unsigned char *p = pattern; + register unsigned char *pend = pattern + size; + register int j, k; + unsigned char *translate = (unsigned char *) bufp->translate; + unsigned is_a_succeed_n; + +#ifndef NO_ALLOCA + unsigned char *stackb[NFAILURES]; + unsigned char **stackp = stackb; + +#else + unsigned char **stackb; + unsigned char **stackp; + stackb = (unsigned char **) malloc (NFAILURES * sizeof (unsigned char *)); + stackp = stackb; + +#endif /* NO_ALLOCA */ + memset (fastmap, 0, (1 << BYTEWIDTH)); + bufp->fastmap_accurate = 1; + bufp->can_be_null = 0; + + while (p) + { + is_a_succeed_n = 0; + if (p == pend) + { + bufp->can_be_null = 1; + break; + } +#ifdef SWITCH_ENUM_BUG + switch ((int) ((enum regexpcode) *p++)) +#else + switch ((enum regexpcode) *p++) +#endif + { + case exactn: + if (translate) + fastmap[translate[p[1]]] = 1; + else + fastmap[p[1]] = 1; + break; + + case begline: + case before_dot: + case at_dot: + case after_dot: + case begbuf: + case endbuf: + case wordbound: + case notwordbound: + case wordbeg: + case wordend: + continue; + + case endline: + if (translate) + fastmap[translate['\n']] = 1; + else + fastmap['\n'] = 1; + + if (bufp->can_be_null != 1) + bufp->can_be_null = 2; + break; + + case jump_n: + case finalize_jump: + case maybe_finalize_jump: + case jump: + case dummy_failure_jump: + EXTRACT_NUMBER_AND_INCR (j, p); + p += j; + if (j > 0) + continue; + /* Jump backward reached implies we just went through + the body of a loop and matched nothing. + Opcode jumped to should be an on_failure_jump. + Just treat it like an ordinary jump. + For a * loop, it has pushed its failure point already; + If so, discard that as redundant. */ + + if ((enum regexpcode) *p != on_failure_jump + && (enum regexpcode) *p != succeed_n) + continue; + p++; + EXTRACT_NUMBER_AND_INCR (j, p); + p += j; + if (stackp != stackb && *stackp == p) + stackp--; + continue; + + case on_failure_jump: + handle_on_failure_jump: + EXTRACT_NUMBER_AND_INCR (j, p); + *++stackp = p + j; + if (is_a_succeed_n) + EXTRACT_NUMBER_AND_INCR (k, p); /* Skip the n. */ + continue; + + case succeed_n: + is_a_succeed_n = 1; + /* Get to the number of times to succeed. */ + p += 2; + /* Increment p past the n for when k != 0. */ + EXTRACT_NUMBER_AND_INCR (k, p); + if (k == 0) + { + p -= 4; + goto handle_on_failure_jump; + } + continue; + + case set_number_at: + p += 4; + continue; + + case start_memory: + case stop_memory: + p++; + continue; + + case duplicate: + bufp->can_be_null = 1; + fastmap['\n'] = 1; + case anychar: + for (j = 0; j < (1 << BYTEWIDTH); j++) + if (j != '\n') + fastmap[j] = 1; + if (bufp->can_be_null) + { + FREE_AND_RETURN_VOID(stackb); + } + /* Don't return; check the alternative paths + so we can set can_be_null if appropriate. */ + break; + + case wordchar: + for (j = 0; j < (1 << BYTEWIDTH); j++) + if (SYNTAX (j) == Sword) + fastmap[j] = 1; + break; + + case notwordchar: + for (j = 0; j < (1 << BYTEWIDTH); j++) + if (SYNTAX (j) != Sword) + fastmap[j] = 1; + break; + +#ifdef emacs + case syntaxspec: + k = *p++; + for (j = 0; j < (1 << BYTEWIDTH); j++) + if (SYNTAX (j) == (enum syntaxcode) k) + fastmap[j] = 1; + break; + + case notsyntaxspec: + k = *p++; + for (j = 0; j < (1 << BYTEWIDTH); j++) + if (SYNTAX (j) != (enum syntaxcode) k) + fastmap[j] = 1; + break; + +#else /* not emacs */ + case syntaxspec: + case notsyntaxspec: + break; +#endif /* not emacs */ + + case charset: + for (j = *p++ * BYTEWIDTH - 1; j >= 0; j--) + if (p[j / BYTEWIDTH] & (1 << (j % BYTEWIDTH))) + { + if (translate) + fastmap[translate[j]] = 1; + else + fastmap[j] = 1; + } + break; + + case charset_not: + /* Chars beyond end of map must be allowed */ + for (j = *p * BYTEWIDTH; j < (1 << BYTEWIDTH); j++) + if (translate) + fastmap[translate[j]] = 1; + else + fastmap[j] = 1; + + for (j = *p++ * BYTEWIDTH - 1; j >= 0; j--) + if (!(p[j / BYTEWIDTH] & (1 << (j % BYTEWIDTH)))) + { + if (translate) + fastmap[translate[j]] = 1; + else + fastmap[j] = 1; + } + break; + + case unused: /* pacify gcc -Wall */ + break; + } + + /* Get here means we have successfully found the possible starting + characters of one path of the pattern. We need not follow this + path any farther. Instead, look at the next alternative + remembered in the stack. */ + if (stackp != stackb) + p = *stackp--; + else + break; + } + FREE_AND_RETURN_VOID(stackb); +} + + + +/* Like re_search_2, below, but only one string is specified, and + doesn't let you say where to stop matching. */ + +int +re_search (pbufp, string, size, startpos, range, regs) + struct re_pattern_buffer *pbufp; + char *string; + int size, startpos, range; + struct re_registers *regs; +{ + return re_search_2 (pbufp, (char *) 0, 0, string, size, startpos, range, + regs, size); +} + + +/* Using the compiled pattern in PBUFP->buffer, first tries to match the + virtual concatenation of STRING1 and STRING2, starting first at index + STARTPOS, then at STARTPOS + 1, and so on. RANGE is the number of + places to try before giving up. If RANGE is negative, it searches + backwards, i.e., the starting positions tried are STARTPOS, STARTPOS + - 1, etc. STRING1 and STRING2 are of SIZE1 and SIZE2, respectively. + In REGS, return the indices of the virtual concatenation of STRING1 + and STRING2 that matched the entire PBUFP->buffer and its contained + subexpressions. Do not consider matching one past the index MSTOP in + the virtual concatenation of STRING1 and STRING2. + + The value returned is the position in the strings at which the match + was found, or -1 if no match was found, or -2 if error (such as + failure stack overflow). */ + +int +re_search_2 (pbufp, string1, size1, string2, size2, startpos, range, + regs, mstop) + struct re_pattern_buffer *pbufp; + char *string1, *string2; + int size1, size2; + int startpos; + register int range; + struct re_registers *regs; + int mstop; +{ + register char *fastmap = pbufp->fastmap; + register unsigned char *translate = (unsigned char *) pbufp->translate; + int total_size = size1 + size2; + int endpos = startpos + range; + int val; + + /* Check for out-of-range starting position. */ + if (startpos < 0 || startpos > total_size) + return -1; + + /* Fix up range if it would eventually take startpos outside of the + virtual concatenation of string1 and string2. */ + if (endpos < -1) + range = -1 - startpos; + else if (endpos > total_size) + range = total_size - startpos; + + /* Update the fastmap now if not correct already. */ + if (fastmap && !pbufp->fastmap_accurate) + re_compile_fastmap (pbufp); + + /* If the search isn't to be a backwards one, don't waste time in a + long search for a pattern that says it is anchored. */ + if (pbufp->used > 0 && (enum regexpcode) pbufp->buffer[0] == begbuf + && range > 0) + { + if (startpos > 0) + return -1; + else + range = 1; + } + + while (1) + { + /* If a fastmap is supplied, skip quickly over characters that + cannot possibly be the start of a match. Note, however, that + if the pattern can possibly match the null string, we must + test it at each starting point so that we take the first null + string we get. */ + + if (fastmap && startpos < total_size && pbufp->can_be_null != 1) + { + if (range > 0) /* Searching forwards. */ + { + register int lim = 0; + register unsigned char *p; + int irange = range; + if (startpos < size1 && startpos + range >= size1) + lim = range - (size1 - startpos); + + p = ((unsigned char *) + &(startpos >= size1 ? string2 - size1 : string1)[startpos]); + + while (range > lim && !fastmap[translate + ? translate[*p++] + : *p++]) + range--; + startpos += irange - range; + } + else /* Searching backwards. */ + { + register unsigned char c; + + if (string1 == 0 || startpos >= size1) + c = string2[startpos - size1]; + else + c = string1[startpos]; + + c &= 0xff; + if (translate ? !fastmap[translate[c]] : !fastmap[c]) + goto advance; + } + } + + if (range >= 0 && startpos == total_size + && fastmap && pbufp->can_be_null == 0) + return -1; + + val = re_match_2 (pbufp, string1, size1, string2, size2, startpos, + regs, mstop); + if (val >= 0) + return startpos; + if (val == -2) + return -2; + +#ifndef NO_ALLOCA +#ifdef C_ALLOCA + alloca (0); +#endif /* C_ALLOCA */ + +#endif /* NO_ALLOCA */ + advance: + if (!range) + break; + else if (range > 0) + { + range--; + startpos++; + } + else + { + range++; + startpos--; + } + } + return -1; +} + + + +#ifndef emacs /* emacs never uses this. */ +int +re_match (pbufp, string, size, pos, regs) + struct re_pattern_buffer *pbufp; + char *string; + int size, pos; + struct re_registers *regs; +{ + return re_match_2 (pbufp, (char *) 0, 0, string, size, pos, regs, size); +} +#endif /* not emacs */ + + +/* The following are used for re_match_2, defined below: */ + +/* Roughly the maximum number of failure points on the stack. Would be + exactly that if always pushed MAX_NUM_FAILURE_ITEMS each time we failed. */ + +int re_max_failures = 2000; + +/* Routine used by re_match_2. */ +/* static int memcmp_translate (); *//* already declared */ + + +/* Structure and accessing macros used in re_match_2: */ + +struct register_info +{ + unsigned is_active : 1; + unsigned matched_something : 1; +}; + +#define IS_ACTIVE(R) ((R).is_active) +#define MATCHED_SOMETHING(R) ((R).matched_something) + + +/* Macros used by re_match_2: */ + + +/* I.e., regstart, regend, and reg_info. */ + +#define NUM_REG_ITEMS 3 + +/* We push at most this many things on the stack whenever we + fail. The `+ 2' refers to PATTERN_PLACE and STRING_PLACE, which are + arguments to the PUSH_FAILURE_POINT macro. */ + +#define MAX_NUM_FAILURE_ITEMS (RE_NREGS * NUM_REG_ITEMS + 2) + + +/* We push this many things on the stack whenever we fail. */ + +#define NUM_FAILURE_ITEMS (last_used_reg * NUM_REG_ITEMS + 2) + + +/* This pushes most of the information about the current state we will want + if we ever fail back to it. */ + +#define PUSH_FAILURE_POINT(pattern_place, string_place) \ + { \ + long last_used_reg, this_reg; \ + \ + /* Find out how many registers are active or have been matched. \ + (Aside from register zero, which is only set at the end.) */ \ + for (last_used_reg = RE_NREGS - 1; last_used_reg > 0; last_used_reg--)\ + if (regstart[last_used_reg] != (unsigned char *)(-1L)) \ + break; \ + \ + if (stacke - stackp < NUM_FAILURE_ITEMS) \ + { \ + unsigned char **stackx; \ + unsigned int len = stacke - stackb; \ + if (len > re_max_failures * MAX_NUM_FAILURE_ITEMS) \ + { \ + FREE_AND_RETURN(stackb,(-2)); \ + } \ + \ + /* Roughly double the size of the stack. */ \ + stackx = DOUBLE_STACK(stackx,stackb,len); \ + /* Rearrange the pointers. */ \ + stackp = stackx + (stackp - stackb); \ + stackb = stackx; \ + stacke = stackb + 2 * len; \ + } \ + \ + /* Now push the info for each of those registers. */ \ + for (this_reg = 1; this_reg <= last_used_reg; this_reg++) \ + { \ + *stackp++ = regstart[this_reg]; \ + *stackp++ = regend[this_reg]; \ + *stackp++ = (unsigned char *) ®_info[this_reg]; \ + } \ + \ + /* Push how many registers we saved. */ \ + *stackp++ = (unsigned char *) last_used_reg; \ + \ + *stackp++ = pattern_place; \ + *stackp++ = string_place; \ + } + + +/* This pops what PUSH_FAILURE_POINT pushes. */ + +#define POP_FAILURE_POINT() \ + { \ + int temp; \ + stackp -= 2; /* Remove failure points. */ \ + temp = (int) *--stackp; /* How many regs pushed. */ \ + temp *= NUM_REG_ITEMS; /* How much to take off the stack. */ \ + stackp -= temp; /* Remove the register info. */ \ + } + + +#define MATCHING_IN_FIRST_STRING (dend == end_match_1) + +/* Is true if there is a first string and if PTR is pointing anywhere + inside it or just past the end. */ + +#define IS_IN_FIRST_STRING(ptr) \ + (size1 && string1 <= (ptr) && (ptr) <= string1 + size1) + +/* Call before fetching a character with *d. This switches over to + string2 if necessary. */ + +#define PREFETCH \ + while (d == dend) \ + { \ + /* end of string2 => fail. */ \ + if (dend == end_match_2) \ + goto fail; \ + /* end of string1 => advance to string2. */ \ + d = string2; \ + dend = end_match_2; \ + } + + +/* Call this when have matched something; it sets `matched' flags for the + registers corresponding to the subexpressions of which we currently + are inside. */ +#define SET_REGS_MATCHED \ + { unsigned this_reg; \ + for (this_reg = 0; this_reg < RE_NREGS; this_reg++) \ + { \ + if (IS_ACTIVE(reg_info[this_reg])) \ + MATCHED_SOMETHING(reg_info[this_reg]) = 1; \ + else \ + MATCHED_SOMETHING(reg_info[this_reg]) = 0; \ + } \ + } + +/* Test if at very beginning or at very end of the virtual concatenation + of string1 and string2. If there is only one string, we've put it in + string2. */ + +#define AT_STRINGS_BEG (d == (size1 ? string1 : string2) || !size2) +#define AT_STRINGS_END (d == end2) + +#define AT_WORD_BOUNDARY \ + (AT_STRINGS_BEG || AT_STRINGS_END || IS_A_LETTER (d - 1) != IS_A_LETTER (d)) + +/* We have two special cases to check for: + 1) if we're past the end of string1, we have to look at the first + character in string2; + 2) if we're before the beginning of string2, we have to look at the + last character in string1; we assume there is a string1, so use + this in conjunction with AT_STRINGS_BEG. */ +#define IS_A_LETTER(d) \ + (SYNTAX ((d) == end1 ? *string2 : (d) == string2 - 1 ? *(end1 - 1) : *(d))\ + == Sword) + + +/* Match the pattern described by PBUFP against the virtual + concatenation of STRING1 and STRING2, which are of SIZE1 and SIZE2, + respectively. Start the match at index POS in the virtual + concatenation of STRING1 and STRING2. In REGS, return the indices of + the virtual concatenation of STRING1 and STRING2 that matched the + entire PBUFP->buffer and its contained subexpressions. Do not + consider matching one past the index MSTOP in the virtual + concatenation of STRING1 and STRING2. + + If pbufp->fastmap is nonzero, then it had better be up to date. + + The reason that the data to match are specified as two components + which are to be regarded as concatenated is so this function can be + used directly on the contents of an Emacs buffer. + + -1 is returned if there is no match. -2 is returned if there is an + error (such as match stack overflow). Otherwise the value is the + length of the substring which was matched. */ + +int +re_match_2 (pbufp, string1_arg, size1, string2_arg, size2, pos, regs, mstop) + struct re_pattern_buffer *pbufp; + char *string1_arg, *string2_arg; + int size1, size2; + int pos; + struct re_registers *regs; + int mstop; +{ + register unsigned char *p = (unsigned char *) pbufp->buffer; + + /* Pointer to beyond end of buffer. */ + register unsigned char *pend = p + pbufp->used; + + unsigned char *string1 = (unsigned char *) string1_arg; + unsigned char *string2 = (unsigned char *) string2_arg; + unsigned char *end1; /* Just past end of first string. */ + unsigned char *end2; /* Just past end of second string. */ + + /* Pointers into string1 and string2, just past the last characters in + each to consider matching. */ + unsigned char *end_match_1, *end_match_2; + + register unsigned char *d, *dend; + register int mcnt; /* Multipurpose. */ + unsigned char *translate = (unsigned char *) pbufp->translate; + unsigned is_a_jump_n = 0; + + /* Failure point stack. Each place that can handle a failure further + down the line pushes a failure point on this stack. It consists of + restart, regend, and reg_info for all registers corresponding to the + subexpressions we're currently inside, plus the number of such + registers, and, finally, two char *'s. The first char * is where to + resume scanning the pattern; the second one is where to resume + scanning the strings. If the latter is zero, the failure point is a + ``dummy''; if a failure happens and the failure point is a dummy, it + gets discarded and the next next one is tried. */ + +#ifndef NO_ALLOCA + unsigned char *initial_stack[MAX_NUM_FAILURE_ITEMS * NFAILURES]; +#endif + unsigned char **stackb; + unsigned char **stackp; + unsigned char **stacke; + + + /* Information on the contents of registers. These are pointers into + the input strings; they record just what was matched (on this + attempt) by a subexpression part of the pattern, that is, the + regnum-th regstart pointer points to where in the pattern we began + matching and the regnum-th regend points to right after where we + stopped matching the regnum-th subexpression. (The zeroth register + keeps track of what the whole pattern matches.) */ + + unsigned char *regstart[RE_NREGS]; + unsigned char *regend[RE_NREGS]; + + /* The is_active field of reg_info helps us keep track of which (possibly + nested) subexpressions we are currently in. The matched_something + field of reg_info[reg_num] helps us tell whether or not we have + matched any of the pattern so far this time through the reg_num-th + subexpression. These two fields get reset each time through any + loop their register is in. */ + + struct register_info reg_info[RE_NREGS]; + + + /* The following record the register info as found in the above + variables when we find a match better than any we've seen before. + This happens as we backtrack through the failure points, which in + turn happens only if we have not yet matched the entire string. */ + + unsigned best_regs_set = 0; + unsigned char *best_regstart[RE_NREGS]; + unsigned char *best_regend[RE_NREGS]; + + /* Initialize the stack. */ +#ifdef NO_ALLOCA + stackb = (unsigned char **) malloc (MAX_NUM_FAILURE_ITEMS * NFAILURES * sizeof (char *)); +#else + stackb = initial_stack; +#endif + stackp = stackb; + stacke = &stackb[MAX_NUM_FAILURE_ITEMS * NFAILURES]; + +#ifdef DEBUG_REGEX + fprintf (stderr, "Entering re_match_2(%s%s)\n", string1_arg, string2_arg); +#endif + + /* Initialize subexpression text positions to -1 to mark ones that no + \( or ( and \) or ) has been seen for. Also set all registers to + inactive and mark them as not having matched anything or ever + failed. */ + for (mcnt = 0; mcnt < RE_NREGS; mcnt++) + { + regstart[mcnt] = regend[mcnt] = (unsigned char *) (-1L); + IS_ACTIVE (reg_info[mcnt]) = 0; + MATCHED_SOMETHING (reg_info[mcnt]) = 0; + } + + if (regs) + for (mcnt = 0; mcnt < RE_NREGS; mcnt++) + regs->start[mcnt] = regs->end[mcnt] = -1; + + /* Set up pointers to ends of strings. + Don't allow the second string to be empty unless both are empty. */ + if (size2 == 0) + { + string2 = string1; + size2 = size1; + string1 = 0; + size1 = 0; + } + end1 = string1 + size1; + end2 = string2 + size2; + + /* Compute where to stop matching, within the two strings. */ + if (mstop <= size1) + { + end_match_1 = string1 + mstop; + end_match_2 = string2; + } + else + { + end_match_1 = end1; + end_match_2 = string2 + mstop - size1; + } + + /* `p' scans through the pattern as `d' scans through the data. `dend' + is the end of the input string that `d' points within. `d' is + advanced into the following input string whenever necessary, but + this happens before fetching; therefore, at the beginning of the + loop, `d' can be pointing at the end of a string, but it cannot + equal string2. */ + + if (size1 != 0 && pos <= size1) + d = string1 + pos, dend = end_match_1; + else + d = string2 + pos - size1, dend = end_match_2; + + + /* This loops over pattern commands. It exits by returning from the + function if match is complete, or it drops through if match fails + at this starting point in the input data. */ + + while (1) + { +#ifdef DEBUG_REGEX + fprintf (stderr, + "regex loop(%d): matching 0x%02d\n", + p - (unsigned char *) pbufp->buffer, + *p); +#endif + is_a_jump_n = 0; + /* End of pattern means we might have succeeded. */ + if (p == pend) + { + /* If not end of string, try backtracking. Otherwise done. */ + if (d != end_match_2) + { + if (stackp != stackb) + { + /* More failure points to try. */ + + unsigned in_same_string = + IS_IN_FIRST_STRING (best_regend[0]) + == MATCHING_IN_FIRST_STRING; + + /* If exceeds best match so far, save it. */ + if (! best_regs_set + || (in_same_string && d > best_regend[0]) + || (! in_same_string && ! MATCHING_IN_FIRST_STRING)) + { + best_regs_set = 1; + best_regend[0] = d; /* Never use regstart[0]. */ + + for (mcnt = 1; mcnt < RE_NREGS; mcnt++) + { + best_regstart[mcnt] = regstart[mcnt]; + best_regend[mcnt] = regend[mcnt]; + } + } + goto fail; + } + /* If no failure points, don't restore garbage. */ + else if (best_regs_set) + { + restore_best_regs: + /* Restore best match. */ + d = best_regend[0]; + + for (mcnt = 0; mcnt < RE_NREGS; mcnt++) + { + regstart[mcnt] = best_regstart[mcnt]; + regend[mcnt] = best_regend[mcnt]; + } + } + } + + /* If caller wants register contents data back, convert it + to indices. */ + if (regs) + { + regs->start[0] = pos; + if (MATCHING_IN_FIRST_STRING) + regs->end[0] = d - string1; + else + regs->end[0] = d - string2 + size1; + for (mcnt = 1; mcnt < RE_NREGS; mcnt++) + { + if (regend[mcnt] == (unsigned char *)(-1L)) + { + regs->start[mcnt] = -1; + regs->end[mcnt] = -1; + continue; + } + if (IS_IN_FIRST_STRING (regstart[mcnt])) + regs->start[mcnt] = regstart[mcnt] - string1; + else + regs->start[mcnt] = regstart[mcnt] - string2 + size1; + + if (IS_IN_FIRST_STRING (regend[mcnt])) + regs->end[mcnt] = regend[mcnt] - string1; + else + regs->end[mcnt] = regend[mcnt] - string2 + size1; + } + } + FREE_AND_RETURN(stackb, + (d - pos - (MATCHING_IN_FIRST_STRING ? + string1 : + string2 - size1))); + } + + /* Otherwise match next pattern command. */ +#ifdef SWITCH_ENUM_BUG + switch ((int) ((enum regexpcode) *p++)) +#else + switch ((enum regexpcode) *p++) +#endif + { + + /* \( [or `(', as appropriate] is represented by start_memory, + \) by stop_memory. Both of those commands are followed by + a register number in the next byte. The text matched + within the \( and \) is recorded under that number. */ + case start_memory: + regstart[*p] = d; + IS_ACTIVE (reg_info[*p]) = 1; + MATCHED_SOMETHING (reg_info[*p]) = 0; + p++; + break; + + case stop_memory: + regend[*p] = d; + IS_ACTIVE (reg_info[*p]) = 0; + + /* If just failed to match something this time around with a sub- + expression that's in a loop, try to force exit from the loop. */ + if ((! MATCHED_SOMETHING (reg_info[*p]) + || (enum regexpcode) p[-3] == start_memory) + && (p + 1) != pend) + { + register unsigned char *p2 = p + 1; + mcnt = 0; + switch (*p2++) + { + case jump_n: + is_a_jump_n = 1; + case finalize_jump: + case maybe_finalize_jump: + case jump: + case dummy_failure_jump: + EXTRACT_NUMBER_AND_INCR (mcnt, p2); + if (is_a_jump_n) + p2 += 2; + break; + } + p2 += mcnt; + + /* If the next operation is a jump backwards in the pattern + to an on_failure_jump, exit from the loop by forcing a + failure after pushing on the stack the on_failure_jump's + jump in the pattern, and d. */ + if (mcnt < 0 && (enum regexpcode) *p2++ == on_failure_jump) + { + EXTRACT_NUMBER_AND_INCR (mcnt, p2); + PUSH_FAILURE_POINT (p2 + mcnt, d); + goto fail; + } + } + p++; + break; + + /* \<digit> has been turned into a `duplicate' command which is + followed by the numeric value of <digit> as the register number. */ + case duplicate: + { + int regno = *p++; /* Get which register to match against */ + register unsigned char *d2, *dend2; + + /* Where in input to try to start matching. */ + d2 = regstart[regno]; + + /* Where to stop matching; if both the place to start and + the place to stop matching are in the same string, then + set to the place to stop, otherwise, for now have to use + the end of the first string. */ + + dend2 = ((IS_IN_FIRST_STRING (regstart[regno]) + == IS_IN_FIRST_STRING (regend[regno])) + ? regend[regno] : end_match_1); + while (1) + { + /* If necessary, advance to next segment in register + contents. */ + while (d2 == dend2) + { + if (dend2 == end_match_2) break; + if (dend2 == regend[regno]) break; + d2 = string2, dend2 = regend[regno]; /* end of string1 => advance to string2. */ + } + /* At end of register contents => success */ + if (d2 == dend2) break; + + /* If necessary, advance to next segment in data. */ + PREFETCH; + + /* How many characters left in this segment to match. */ + mcnt = dend - d; + + /* Want how many consecutive characters we can match in + one shot, so, if necessary, adjust the count. */ + if (mcnt > dend2 - d2) + mcnt = dend2 - d2; + + /* Compare that many; failure if mismatch, else move + past them. */ + if (translate + ? memcmp_translate (d, d2, mcnt, translate) + : memcmp ((char *)d, (char *)d2, mcnt)) + goto fail; + d += mcnt, d2 += mcnt; + } + } + break; + + case anychar: + PREFETCH; /* Fetch a data character. */ + /* Match anything but a newline, maybe even a null. */ + if ((translate ? translate[*d] : *d) == '\n' + || ((obscure_syntax & RE_DOT_NOT_NULL) + && (translate ? translate[*d] : *d) == '\000')) + goto fail; + SET_REGS_MATCHED; + d++; + break; + + case charset: + case charset_not: + { + int not = 0; /* Nonzero for charset_not. */ + register int c; + if (*(p - 1) == (unsigned char) charset_not) + not = 1; + + PREFETCH; /* Fetch a data character. */ + + if (translate) + c = translate[*d]; + else + c = *d; + + if (c < *p * BYTEWIDTH + && p[1 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) + not = !not; + + p += 1 + *p; + + if (!not) goto fail; + SET_REGS_MATCHED; + d++; + break; + } + + case begline: + if ((size1 != 0 && d == string1) + || (size1 == 0 && size2 != 0 && d == string2) + || (d && d[-1] == '\n') + || (size1 == 0 && size2 == 0)) + break; + else + goto fail; + + case endline: + if (d == end2 + || (d == end1 ? (size2 == 0 || *string2 == '\n') : *d == '\n')) + break; + goto fail; + + /* `or' constructs are handled by starting each alternative with + an on_failure_jump that points to the start of the next + alternative. Each alternative except the last ends with a + jump to the joining point. (Actually, each jump except for + the last one really jumps to the following jump, because + tensioning the jumps is a hassle.) */ + + /* The start of a stupid repeat has an on_failure_jump that points + past the end of the repeat text. This makes a failure point so + that on failure to match a repetition, matching restarts past + as many repetitions have been found with no way to fail and + look for another one. */ + + /* A smart repeat is similar but loops back to the on_failure_jump + so that each repetition makes another failure point. */ + + case on_failure_jump: + on_failure: + EXTRACT_NUMBER_AND_INCR (mcnt, p); + PUSH_FAILURE_POINT (p + mcnt, d); + break; + + /* The end of a smart repeat has a maybe_finalize_jump back. + Change it either to a finalize_jump or an ordinary jump. */ + case maybe_finalize_jump: + EXTRACT_NUMBER_AND_INCR (mcnt, p); + { + register unsigned char *p2 = p; + /* Compare what follows with the beginning of the repeat. + If we can establish that there is nothing that they would + both match, we can change to finalize_jump. */ + while (p2 + 1 != pend + && (*p2 == (unsigned char) stop_memory + || *p2 == (unsigned char) start_memory)) + p2 += 2; /* Skip over reg number. */ + if (p2 == pend) + p[-3] = (unsigned char) finalize_jump; + else if (*p2 == (unsigned char) exactn + || *p2 == (unsigned char) endline) + { + register int c = *p2 == (unsigned char) endline ? '\n' : p2[2]; + register unsigned char *p1 = p + mcnt; + /* p1[0] ... p1[2] are an on_failure_jump. + Examine what follows that. */ + if (p1[3] == (unsigned char) exactn && p1[5] != c) + p[-3] = (unsigned char) finalize_jump; + else if (p1[3] == (unsigned char) charset + || p1[3] == (unsigned char) charset_not) + { + int not = p1[3] == (unsigned char) charset_not; + if (c < p1[4] * BYTEWIDTH + && p1[5 + c / BYTEWIDTH] & (1 << (c % BYTEWIDTH))) + not = !not; + /* `not' is 1 if c would match. */ + /* That means it is not safe to finalize. */ + if (!not) + p[-3] = (unsigned char) finalize_jump; + } + } + } + p -= 2; /* Point at relative address again. */ + if (p[-1] != (unsigned char) finalize_jump) + { + p[-1] = (unsigned char) jump; + goto nofinalize; + } + /* Note fall through. */ + + /* The end of a stupid repeat has a finalize_jump back to the + start, where another failure point will be made which will + point to after all the repetitions found so far. */ + + /* Take off failure points put on by matching on_failure_jump + because didn't fail. Also remove the register information + put on by the on_failure_jump. */ + case finalize_jump: + POP_FAILURE_POINT (); + /* Note fall through. */ + + /* Jump without taking off any failure points. */ + case jump: + nofinalize: + EXTRACT_NUMBER_AND_INCR (mcnt, p); + p += mcnt; + break; + + case dummy_failure_jump: + /* Normally, the on_failure_jump pushes a failure point, which + then gets popped at finalize_jump. We will end up at + finalize_jump, also, and with a pattern of, say, `a+', we + are skipping over the on_failure_jump, so we have to push + something meaningless for finalize_jump to pop. */ + PUSH_FAILURE_POINT (0, 0); + goto nofinalize; + + + /* Have to succeed matching what follows at least n times. Then + just handle like an on_failure_jump. */ + case succeed_n: + EXTRACT_NUMBER (mcnt, p + 2); + /* Originally, this is how many times we HAVE to succeed. */ + if (mcnt) + { + mcnt--; + p += 2; + STORE_NUMBER_AND_INCR (p, mcnt); + } + else if (mcnt == 0) + { + p[2] = unused; + p[3] = unused; + goto on_failure; + } + else + { + fprintf (stderr, "regex: the succeed_n's n is not set.\n"); + exit (1); + } + break; + + case jump_n: + EXTRACT_NUMBER (mcnt, p + 2); + /* Originally, this is how many times we CAN jump. */ + if (mcnt) + { + mcnt--; + STORE_NUMBER(p + 2, mcnt); + goto nofinalize; /* Do the jump without taking off + any failure points. */ + } + /* If don't have to jump any more, skip over the rest of command. */ + else + p += 4; + break; + + case set_number_at: + { + register unsigned char *p1; + + EXTRACT_NUMBER_AND_INCR (mcnt, p); + p1 = p + mcnt; + EXTRACT_NUMBER_AND_INCR (mcnt, p); + STORE_NUMBER (p1, mcnt); + break; + } + + /* Ignore these. Used to ignore the n of succeed_n's which + currently have n == 0. */ + case unused: + break; + + case wordbound: + if (AT_WORD_BOUNDARY) + break; + goto fail; + + case notwordbound: + if (AT_WORD_BOUNDARY) + goto fail; + break; + + case wordbeg: + if (IS_A_LETTER (d) && (!IS_A_LETTER (d - 1) || AT_STRINGS_BEG)) + break; + goto fail; + + case wordend: + /* Have to check if AT_STRINGS_BEG before looking at d - 1. */ + if (!AT_STRINGS_BEG && IS_A_LETTER (d - 1) + && (!IS_A_LETTER (d) || AT_STRINGS_END)) + break; + goto fail; + +#ifdef emacs + case before_dot: + if (PTR_CHAR_POS (d) >= point) + goto fail; + break; + + case at_dot: + if (PTR_CHAR_POS (d) != point) + goto fail; + break; + + case after_dot: + if (PTR_CHAR_POS (d) <= point) + goto fail; + break; + + case wordchar: + mcnt = (int) Sword; + goto matchsyntax; + + case syntaxspec: + mcnt = *p++; + matchsyntax: + PREFETCH; + if (SYNTAX (*d++) != (enum syntaxcode) mcnt) goto fail; + SET_REGS_MATCHED; + break; + + case notwordchar: + mcnt = (int) Sword; + goto matchnotsyntax; + + case notsyntaxspec: + mcnt = *p++; + matchnotsyntax: + PREFETCH; + if (SYNTAX (*d++) == (enum syntaxcode) mcnt) goto fail; + SET_REGS_MATCHED; + break; + +#else /* not emacs */ + + case wordchar: + PREFETCH; + if (!IS_A_LETTER (d)) + goto fail; + SET_REGS_MATCHED; + break; + + case notwordchar: + PREFETCH; + if (IS_A_LETTER (d)) + goto fail; + SET_REGS_MATCHED; + break; + + case before_dot: + case at_dot: + case after_dot: + case syntaxspec: + case notsyntaxspec: + break; + +#endif /* not emacs */ + + case begbuf: + if (AT_STRINGS_BEG) + break; + goto fail; + + case endbuf: + if (AT_STRINGS_END) + break; + goto fail; + + case exactn: + /* Match the next few pattern characters exactly. + mcnt is how many characters to match. */ + mcnt = *p++; + /* This is written out as an if-else so we don't waste time + testing `translate' inside the loop. */ + if (translate) + { + do + { + PREFETCH; + if (translate[*d++] != *p++) goto fail; + } + while (--mcnt); + } + else + { + do + { + PREFETCH; + if (*d++ != *p++) goto fail; + } + while (--mcnt); + } + SET_REGS_MATCHED; + break; + } + continue; /* Successfully executed one pattern command; keep going. */ + + /* Jump here if any matching operation fails. */ + fail: + if (stackp != stackb) + /* A restart point is known. Restart there and pop it. */ + { + short last_used_reg, this_reg; + + /* If this failure point is from a dummy_failure_point, just + skip it. */ + if (!stackp[-2]) + { + POP_FAILURE_POINT (); + goto fail; + } + + d = *--stackp; + p = *--stackp; + if (d >= string1 && d <= end1) + dend = end_match_1; + /* Restore register info. */ + last_used_reg = (long) *--stackp; + + /* Make the ones that weren't saved -1 or 0 again. */ + for (this_reg = RE_NREGS - 1; this_reg > last_used_reg; this_reg--) + { + regend[this_reg] = (unsigned char *) (-1L); + regstart[this_reg] = (unsigned char *) (-1L); + IS_ACTIVE (reg_info[this_reg]) = 0; + MATCHED_SOMETHING (reg_info[this_reg]) = 0; + } + + /* And restore the rest from the stack. */ + for ( ; this_reg > 0; this_reg--) + { + reg_info[this_reg] = *(struct register_info *) *--stackp; + regend[this_reg] = *--stackp; + regstart[this_reg] = *--stackp; + } + } + else + break; /* Matching at this starting point really fails. */ + } + + if (best_regs_set) + goto restore_best_regs; + + FREE_AND_RETURN(stackb,(-1)); /* Failure to match. */ +} + + +static int +memcmp_translate (s1, s2, len, translate) + unsigned char *s1, *s2; + register int len; + unsigned char *translate; +{ + register unsigned char *p1 = s1, *p2 = s2; + while (len) + { + if (translate [*p1++] != translate [*p2++]) return 1; + len--; + } + return 0; +} + + + +/* Entry points compatible with 4.2 BSD regex library. */ + +#if !defined(emacs) && !defined(GAWK) + +static struct re_pattern_buffer re_comp_buf; + +char * +re_comp (s) + char *s; +{ + if (!s) + { + if (!re_comp_buf.buffer) + return "No previous regular expression"; + return 0; + } + + if (!re_comp_buf.buffer) + { + if (!(re_comp_buf.buffer = (char *) malloc (200))) + return "Memory exhausted"; + re_comp_buf.allocated = 200; + if (!(re_comp_buf.fastmap = (char *) malloc (1 << BYTEWIDTH))) + return "Memory exhausted"; + } + return re_compile_pattern (s, strlen (s), &re_comp_buf); +} + +int +re_exec (s) + char *s; +{ + int len = strlen (s); + return 0 <= re_search (&re_comp_buf, s, len, 0, len, + (struct re_registers *) 0); +} +#endif /* not emacs && not GAWK */ + + + +#ifdef test + +#ifdef atarist +long _stksize = 2L; /* reserve memory for stack */ +#endif +#include <stdio.h> + +/* Indexed by a character, gives the upper case equivalent of the + character. */ + +char upcase[0400] = + { 000, 001, 002, 003, 004, 005, 006, 007, + 010, 011, 012, 013, 014, 015, 016, 017, + 020, 021, 022, 023, 024, 025, 026, 027, + 030, 031, 032, 033, 034, 035, 036, 037, + 040, 041, 042, 043, 044, 045, 046, 047, + 050, 051, 052, 053, 054, 055, 056, 057, + 060, 061, 062, 063, 064, 065, 066, 067, + 070, 071, 072, 073, 074, 075, 076, 077, + 0100, 0101, 0102, 0103, 0104, 0105, 0106, 0107, + 0110, 0111, 0112, 0113, 0114, 0115, 0116, 0117, + 0120, 0121, 0122, 0123, 0124, 0125, 0126, 0127, + 0130, 0131, 0132, 0133, 0134, 0135, 0136, 0137, + 0140, 0101, 0102, 0103, 0104, 0105, 0106, 0107, + 0110, 0111, 0112, 0113, 0114, 0115, 0116, 0117, + 0120, 0121, 0122, 0123, 0124, 0125, 0126, 0127, + 0130, 0131, 0132, 0173, 0174, 0175, 0176, 0177, + 0200, 0201, 0202, 0203, 0204, 0205, 0206, 0207, + 0210, 0211, 0212, 0213, 0214, 0215, 0216, 0217, + 0220, 0221, 0222, 0223, 0224, 0225, 0226, 0227, + 0230, 0231, 0232, 0233, 0234, 0235, 0236, 0237, + 0240, 0241, 0242, 0243, 0244, 0245, 0246, 0247, + 0250, 0251, 0252, 0253, 0254, 0255, 0256, 0257, + 0260, 0261, 0262, 0263, 0264, 0265, 0266, 0267, + 0270, 0271, 0272, 0273, 0274, 0275, 0276, 0277, + 0300, 0301, 0302, 0303, 0304, 0305, 0306, 0307, + 0310, 0311, 0312, 0313, 0314, 0315, 0316, 0317, + 0320, 0321, 0322, 0323, 0324, 0325, 0326, 0327, + 0330, 0331, 0332, 0333, 0334, 0335, 0336, 0337, + 0340, 0341, 0342, 0343, 0344, 0345, 0346, 0347, + 0350, 0351, 0352, 0353, 0354, 0355, 0356, 0357, + 0360, 0361, 0362, 0363, 0364, 0365, 0366, 0367, + 0370, 0371, 0372, 0373, 0374, 0375, 0376, 0377 + }; + +#ifdef canned + +#include "tests.h" + +typedef enum { extended_test, basic_test } test_type; + +/* Use this to run the tests we've thought of. */ + +void +main () +{ + test_type t = extended_test; + + if (t == basic_test) + { + printf ("Running basic tests:\n\n"); + test_posix_basic (); + } + else if (t == extended_test) + { + printf ("Running extended tests:\n\n"); + test_posix_extended (); + } +} + +#else /* not canned */ + +/* Use this to run interactive tests. */ + +void +main (argc, argv) + int argc; + char **argv; +{ + char pat[80]; + struct re_pattern_buffer buf; + int i; + char c; + char fastmap[(1 << BYTEWIDTH)]; + + /* Allow a command argument to specify the style of syntax. */ + if (argc > 1) + obscure_syntax = atol (argv[1]); + + buf.allocated = 40; + buf.buffer = (char *) malloc (buf.allocated); + buf.fastmap = fastmap; + buf.translate = upcase; + + while (1) + { + gets (pat); + + if (*pat) + { + re_compile_pattern (pat, strlen(pat), &buf); + + for (i = 0; i < buf.used; i++) + printchar (buf.buffer[i]); + + putchar ('\n'); + + printf ("%d allocated, %d used.\n", buf.allocated, buf.used); + + re_compile_fastmap (&buf); + printf ("Allowed by fastmap: "); + for (i = 0; i < (1 << BYTEWIDTH); i++) + if (fastmap[i]) printchar (i); + putchar ('\n'); + } + + gets (pat); /* Now read the string to match against */ + + i = re_match (&buf, pat, strlen (pat), 0, 0); + printf ("Match value %d.\n", i); + } +} + +#endif + + +#ifdef NOTDEF +print_buf (bufp) + struct re_pattern_buffer *bufp; +{ + int i; + + printf ("buf is :\n----------------\n"); + for (i = 0; i < bufp->used; i++) + printchar (bufp->buffer[i]); + + printf ("\n%d allocated, %d used.\n", bufp->allocated, bufp->used); + + printf ("Allowed by fastmap: "); + for (i = 0; i < (1 << BYTEWIDTH); i++) + if (bufp->fastmap[i]) + printchar (i); + printf ("\nAllowed by translate: "); + if (bufp->translate) + for (i = 0; i < (1 << BYTEWIDTH); i++) + if (bufp->translate[i]) + printchar (i); + printf ("\nfastmap is%s accurate\n", bufp->fastmap_accurate ? "" : "n't"); + printf ("can %s be null\n----------", bufp->can_be_null ? "" : "not"); +} +#endif /* NOTDEF */ + +printchar (c) + char c; +{ + if (c < 040 || c >= 0177) + { + putchar ('\\'); + putchar (((c >> 6) & 3) + '0'); + putchar (((c >> 3) & 7) + '0'); + putchar ((c & 7) + '0'); + } + else + putchar (c); +} + +error (string) + char *string; +{ + puts (string); + exit (1); +} +#endif /* test */ diff --git a/gnu/usr.bin/awk/regex.h b/gnu/usr.bin/awk/regex.h new file mode 100644 index 000000000000..fce11c3a97dd --- /dev/null +++ b/gnu/usr.bin/awk/regex.h @@ -0,0 +1,260 @@ +/* Definitions for data structures callers pass the regex library. + + Copyright (C) 1985, 1989-90 Free Software Foundation, Inc. + + This program is free software; you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation; either version 1, or (at your option) + any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program; if not, write to the Free Software + Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA. */ + + +#ifndef __REGEXP_LIBRARY +#define __REGEXP_LIBRARY + +/* Define number of parens for which we record the beginnings and ends. + This affects how much space the `struct re_registers' type takes up. */ +#ifndef RE_NREGS +#define RE_NREGS 10 +#endif + +#define BYTEWIDTH 8 + + +/* Maximum number of duplicates an interval can allow. */ +#ifndef RE_DUP_MAX +#define RE_DUP_MAX ((1 << 15) - 1) +#endif + + +/* This defines the various regexp syntaxes. */ +extern long obscure_syntax; + + +/* The following bits are used in the obscure_syntax variable to choose among + alternative regexp syntaxes. */ + +/* If this bit is set, plain parentheses serve as grouping, and backslash + parentheses are needed for literal searching. + If not set, backslash-parentheses are grouping, and plain parentheses + are for literal searching. */ +#define RE_NO_BK_PARENS 1L + +/* If this bit is set, plain | serves as the `or'-operator, and \| is a + literal. + If not set, \| serves as the `or'-operator, and | is a literal. */ +#define RE_NO_BK_VBAR (1L << 1) + +/* If this bit is not set, plain + or ? serves as an operator, and \+, \? are + literals. + If set, \+, \? are operators and plain +, ? are literals. */ +#define RE_BK_PLUS_QM (1L << 2) + +/* If this bit is set, | binds tighter than ^ or $. + If not set, the contrary. */ +#define RE_TIGHT_VBAR (1L << 3) + +/* If this bit is set, then treat newline as an OR operator. + If not set, treat it as a normal character. */ +#define RE_NEWLINE_OR (1L << 4) + +/* If this bit is set, then special characters may act as normal + characters in some contexts. Specifically, this applies to: + ^ -- only special at the beginning, or after ( or |; + $ -- only special at the end, or before ) or |; + *, +, ? -- only special when not after the beginning, (, or |. + If this bit is not set, special characters (such as *, ^, and $) + always have their special meaning regardless of the surrounding + context. */ +#define RE_CONTEXT_INDEP_OPS (1L << 5) + +/* If this bit is not set, then \ before anything inside [ and ] is taken as + a real \. + If set, then such a \ escapes the following character. This is a + special case for awk. */ +#define RE_AWK_CLASS_HACK (1L << 6) + +/* If this bit is set, then \{ and \} or { and } serve as interval operators. + If not set, then \{ and \} and { and } are treated as literals. */ +#define RE_INTERVALS (1L << 7) + +/* If this bit is not set, then \{ and \} serve as interval operators and + { and } are literals. + If set, then { and } serve as interval operators and \{ and \} are + literals. */ +#define RE_NO_BK_CURLY_BRACES (1L << 8) + +/* If this bit is set, then character classes are supported; they are: + [:alpha:], [:upper:], [:lower:], [:digit:], [:alnum:], [:xdigit:], + [:space:], [:print:], [:punct:], [:graph:], and [:cntrl:]. + If not set, then character classes are not supported. */ +#define RE_CHAR_CLASSES (1L << 9) + +/* If this bit is set, then the dot re doesn't match a null byte. + If not set, it does. */ +#define RE_DOT_NOT_NULL (1L << 10) + +/* If this bit is set, then [^...] doesn't match a newline. + If not set, it does. */ +#define RE_HAT_NOT_NEWLINE (1L << 11) + +/* If this bit is set, back references are recognized. + If not set, they aren't. */ +#define RE_NO_BK_REFS (1L << 12) + +/* If this bit is set, back references must refer to a preceding + subexpression. If not set, a back reference to a nonexistent + subexpression is treated as literal characters. */ +#define RE_NO_EMPTY_BK_REF (1L << 13) + +/* If this bit is set, bracket expressions can't be empty. + If it is set, they can be empty. */ +#define RE_NO_EMPTY_BRACKETS (1L << 14) + +/* If this bit is set, then *, +, ? and { cannot be first in an re or + immediately after a |, or a (. Furthermore, a | cannot be first or + last in an re, or immediately follow another | or a (. Also, a ^ + cannot appear in a nonleading position and a $ cannot appear in a + nontrailing position (outside of bracket expressions, that is). */ +#define RE_CONTEXTUAL_INVALID_OPS (1L << 15) + +/* If this bit is set, then +, ? and | aren't recognized as operators. + If it's not, they are. */ +#define RE_LIMITED_OPS (1L << 16) + +/* If this bit is set, then an ending range point has to collate higher + or equal to the starting range point. + If it's not set, then when the ending range point collates higher + than the starting range point, the range is just considered empty. */ +#define RE_NO_EMPTY_RANGES (1L << 17) + +/* If this bit is set, then a hyphen (-) can't be an ending range point. + If it isn't, then it can. */ +#define RE_NO_HYPHEN_RANGE_END (1L << 18) + + +/* Define combinations of bits for the standard possibilities. */ +#define RE_SYNTAX_POSIX_AWK (RE_NO_BK_PARENS | RE_NO_BK_VBAR \ + | RE_CONTEXT_INDEP_OPS) +#define RE_SYNTAX_AWK (RE_NO_BK_PARENS | RE_NO_BK_VBAR | RE_AWK_CLASS_HACK) +#define RE_SYNTAX_EGREP (RE_NO_BK_PARENS | RE_NO_BK_VBAR \ + | RE_CONTEXT_INDEP_OPS | RE_NEWLINE_OR) +#define RE_SYNTAX_GREP (RE_BK_PLUS_QM | RE_NEWLINE_OR) +#define RE_SYNTAX_EMACS 0 +#define RE_SYNTAX_POSIX_BASIC (RE_INTERVALS | RE_BK_PLUS_QM \ + | RE_CHAR_CLASSES | RE_DOT_NOT_NULL \ + | RE_HAT_NOT_NEWLINE | RE_NO_EMPTY_BK_REF \ + | RE_NO_EMPTY_BRACKETS | RE_LIMITED_OPS \ + | RE_NO_EMPTY_RANGES | RE_NO_HYPHEN_RANGE_END) + +#define RE_SYNTAX_POSIX_EXTENDED (RE_INTERVALS | RE_NO_BK_CURLY_BRACES \ + | RE_NO_BK_VBAR | RE_NO_BK_PARENS \ + | RE_HAT_NOT_NEWLINE | RE_CHAR_CLASSES \ + | RE_NO_EMPTY_BRACKETS | RE_CONTEXTUAL_INVALID_OPS \ + | RE_NO_BK_REFS | RE_NO_EMPTY_RANGES \ + | RE_NO_HYPHEN_RANGE_END) + + +/* This data structure is used to represent a compiled pattern. */ + +struct re_pattern_buffer + { + char *buffer; /* Space holding the compiled pattern commands. */ + long allocated; /* Size of space that `buffer' points to. */ + long used; /* Length of portion of buffer actually occupied */ + char *fastmap; /* Pointer to fastmap, if any, or zero if none. */ + /* re_search uses the fastmap, if there is one, + to skip over totally implausible characters. */ + char *translate; /* Translate table to apply to all characters before + comparing, or zero for no translation. + The translation is applied to a pattern when it is + compiled and to data when it is matched. */ + char fastmap_accurate; + /* Set to zero when a new pattern is stored, + set to one when the fastmap is updated from it. */ + char can_be_null; /* Set to one by compiling fastmap + if this pattern might match the null string. + It does not necessarily match the null string + in that case, but if this is zero, it cannot. + 2 as value means can match null string + but at end of range or before a character + listed in the fastmap. */ + }; + + +/* search.c (search_buffer) needs this one value. It is defined both in + regex.c and here. */ +#define RE_EXACTN_VALUE 1 + + +/* Structure to store register contents data in. + + Pass the address of such a structure as an argument to re_match, etc., + if you want this information back. + + For i from 1 to RE_NREGS - 1, start[i] records the starting index in + the string of where the ith subexpression matched, and end[i] records + one after the ending index. start[0] and end[0] are analogous, for + the entire pattern. */ + +struct re_registers + { + int start[RE_NREGS]; + int end[RE_NREGS]; + }; + + + +#ifdef __STDC__ + +extern char *re_compile_pattern (char *, size_t, struct re_pattern_buffer *); +/* Is this really advertised? */ +extern void re_compile_fastmap (struct re_pattern_buffer *); +extern int re_search (struct re_pattern_buffer *, char*, int, int, int, + struct re_registers *); +extern int re_search_2 (struct re_pattern_buffer *, char *, int, + char *, int, int, int, + struct re_registers *, int); +extern int re_match (struct re_pattern_buffer *, char *, int, int, + struct re_registers *); +extern int re_match_2 (struct re_pattern_buffer *, char *, int, + char *, int, int, struct re_registers *, int); +extern long re_set_syntax (long syntax); + +#ifndef GAWK +/* 4.2 bsd compatibility. */ +extern char *re_comp (char *); +extern int re_exec (char *); +#endif + +#else /* !__STDC__ */ + +extern char *re_compile_pattern (); +/* Is this really advertised? */ +extern void re_compile_fastmap (); +extern int re_search (), re_search_2 (); +extern int re_match (), re_match_2 (); +extern long re_set_syntax(); + +#ifndef GAWK +/* 4.2 bsd compatibility. */ +extern char *re_comp (); +extern int re_exec (); +#endif + +#endif /* __STDC__ */ + + +#ifdef SYNTAX_TABLE +extern char *re_syntax_table; +#endif + +#endif /* !__REGEXP_LIBRARY */ diff --git a/gnu/usr.bin/awk/version.c b/gnu/usr.bin/awk/version.c new file mode 100644 index 000000000000..adea5fafacfb --- /dev/null +++ b/gnu/usr.bin/awk/version.c @@ -0,0 +1,46 @@ +char *version_string = "@(#)Gnu Awk (gawk) 2.15"; + +/* 1.02 fixed /= += *= etc to return the new Left Hand Side instead + of the Right Hand Side */ + +/* 1.03 Fixed split() to treat strings of space and tab as FS if + the split char is ' '. + + Added -v option to print version number + + Fixed bug that caused rounding when printing large numbers */ + +/* 2.00beta Incorporated the functionality of the "new" awk as described + the book (reference not handy). Extensively tested, but no + doubt still buggy. Badly needs tuning and cleanup, in + particular in memory management which is currently almost + non-existent. */ + +/* 2.01 JF: Modified to compile under GCC, and fixed a few + bugs while I was at it. I hope I didn't add any more. + I modified parse.y to reduce the number of reduce/reduce + conflicts. There are still a few left. */ + +/* 2.02 Fixed JF's bugs; improved memory management, still needs + lots of work. */ + +/* 2.10 Major grammar rework and lots of bug fixes from David. + Major changes for performance enhancements from David. + A number of minor bug fixes and new features from Arnold. + Changes for MSDOS from Conrad Kwok and Scott Garfinkle. + The gawk.texinfo and info files included! */ + +/* 2.11 Bug fix release to 2.10. Lots of changes for portability, + speed, and configurability. */ + +/* 2.12 Lots of changes for portability, speed, and configurability. + Several bugs fixed. POSIX compliance. Removal of last set + of hard-wired limits. Atari and VMS ports added. */ + +/* 2.13 Public release of 2.12 */ + +/* 2.14 Mostly bug fixes. */ + +/* 2.15 Bug fixes plus intermixing of command-line source and files, + GNU long options, ARGIND, ERRNO and Plan 9 style /dev/ files. */ + |