PythonSed

A full and working Python implementation of sed

General Information

Description

pythonsed is a full and working Python implementation of sed. Its reference is GNU sed 4.8 of which it implements nearly all commands and features. It may be used as a command line utility or it can be used as a module to bring sed functionality to Python scripts.

A complete set of tests is available as well as a testing utility. These tests include scripts from various origins and cover all aspects of sed functionalities.

Compatibility

Version	Status
Python 3.10	Fully compatible
Python 3.9	Fully compatible
Python 3.8	Fully compatible
Python 3.7	Fully compatible even for s///g with zero length matches. See this question at stackoverflow
Python 3	Fully compatible
Python 2.7.18 and above	Fully compatible
Python 2.7.17 and below	Not tested

Compatibility status applies also to the testing utility test-suite.py.

License

PythonSed is released under the MIT license.

Install

To install PythonSed, just use pip:

python -m pip install pythonsed

This installs a command line utility named pythonsed and a package named PythonSed.

Another option is to use pipx. This will automatically download pythonsed into a temporary virtual environment and execute it

pipx run --spec git+https://github.com/fschaeck/PythonSed.git pythonsed

Usage as a command line utility

pythonsed is as console program receiving information from the command line. The format of the command line identical to that of GNU sed version 4.8 with the addition of the option -p or --python-syntax, which allows to specify Python regular expressions instead of the less powerful sed variant:

usage: sed.py [-h] [-H] [-v] [-f file] [-e string] [-i [backup suffix]] [-n]
              [-s] [-p] [-r] [-l LINE_LENGTH] [-d]
              [targets [targets ...]]

sed.py - python sed module and command line utility

positional arguments:
  targets               files to be processed (defaults to stdin if not
                        specified)

optional arguments:
  -h, --help            show this help message and exit
  -H, --htmlhelp        open html help page in web browser
  -v                    display version
  -f file, --file file  add script commands from file
  -e string, --expression string
                        add script commands from string
  -i [backup suffix], --in-place [backup suffix]
                        change input files in place
  -n, --quiet, --silent
                        print only if requested
  -s, --separate        consider input files as separate files instead of a
                        continuous strem
  -p, --python-syntax   Python regexp syntax
  -r, -E, --regexp-extended
                        extended regexp syntax
  -l LINE_LENGTH, --line-length LINE_LENGTH
                        line length to be used by l command
  -d, --debug           dump script and annotate execution on stderr

Options -e and -f can be repeated multiple times and add to the commands
executed for each line of input in the sequence they are specified.

If neither -e nor -f is given, the first positional parameter is taken
as the script, as if it had been prefixed with -e.

pythonsed may also use redirection to receive its input or send its output with the usual syntax:

cat myfile | pythonsed -f myscript1.sed | pythonsed -f myscript2.sed > myresultfile

It is also possible for pythonsed to receive its input from the keyboard by omitting any input file:

pythonsed -f myscript.sed

Usage as a Python module

An example covering all necessary symbols:

from PythonSed import Sed, SedException

sed = Sed()
try:
    sed.no_autoprint = True
    sed.regexp_extended = False
    sed.load_script('myscript.sed')
    sed.apply('myinput.txt')
except SedException as e:
    print(e.message)
except:
    raise

sed.apply() input parameter may be a string (which is interpreted as a filename) or file-like object (including streams). Note that sed.apply() returns the list of lines printed by the script. As a default, these lines are printed to stdout. sed.apply() has an output parameter which enables to inhibit printing the lines (output=None) or enables to redirect the output to some text file (output='somefile.txt') or to a file-like object (including streams). Note also that if myinput or myoutput are file-like objects, they must be closed by the caller.

The script is given to sed by one or more calls to sed.load_string(my_script_string) and/or sed.load_script(my_script_file) where my_script_file can be either a file name or a file-like object (including streams). The final script is the concatination of what was given in the various calls to load_string and load_script in the sequence of the calls.

The available attributes and their defaults are:

Attribute	Default
encoding	'latin-1'
line_length	70
no_autoprint	False
regexp_extended	False
sed_compatible	True
in_place	None
separate	False
debug	0..3

They can all be specified as named parameters on the constructor of the Sed object or set individually before the call to apply(), which can be called multiple times on the same Sed object with changing attributes in between. The script can only be appended to after the creation of the Sed object and subsequent calls to apply() will re-compile the extended script before applying it to the input.

sed dialect

PythonSed implements all standard commands and regular expression features of sed. Its reference is GNU sed 4.8. It implements all its features except for character classes (no support in Python for those) and the GNU sed specific command e.

GNU sed manual page can serve as a reference for PythonSed.

Addresses

`number`	standard behavior
`$`	standard behavior
`/regexp/`	standard behavior
`/regexp/I`	standard behavior
`\%regexp%`	standard behavior
`address,address`	standard behavior
`address!`	standard behavior
`0,/regexp/`	standard behavior
`first~step`	standard behavior
`addr1,+N`	standard behavior
`addr1,~N`	standard behavior

Regular expressions

`char`	standard behavior
`*`	standard behavior
`\+`	standard behavior
`\?`	standard behavior
`\{i\} \{i,j\} \{i,\}`	standard behavior
`$regexp$`	standard behavior
`.`	standard behavior
`^`	standard behavior. When not at start of regexp, matches as itself
`$`	standard behavior. When not at end of regexp, matches as itself
`[list] [^list]`	standard behavior
`regexp1\\|regexp2`	standard behavior
`regexp1regexp2`	standard behavior
`\digit`	standard behavior (back reference)
`textual escapes like \n and \t`	standard behavior
`functional escapes like \s, \S, \< and \w`	standard behavior
`\char`	standard behavior (disable special regexp characters)

Note that for any combination of quantifiers (*, +, ?, {}), consecutive quantifiers or a quantifier starting a regexp will produce an error. This is true in basic or extended regular expression modes.

Extended regular expressions

Using the -r switch enables to simplify regular expressions by removing the backslash character before the special characters +, ?, (, ), |, { and }. If these characters must appear as regular characters in a regexp, they must be backslashed.

Commands

`a\ text`	Compliant	(including one liner syntax and double address extensions)
`b label`	Compliant
`: label`	Compliant
`c\ text`	Compliant	(including single line and double address extensions)
`d`	Compliant
`D`	Compliant
`=`	Compliant	(including double address extension)
`g`	Compliant
`G`	Compliant
`h`	Compliant
`H`	Compliant
`i\ text`	Compliant	(including single line and double address extensions)
`l`	Compliant	(length parameter not implemented)
`n`	Compliant
`N`	Compliant
`p`	Compliant
`P`	Compliant
`q`	Compliant	(except exit code extension)
`r filename`	Compliant	(including double address extension but not reading from stdin)
`s`	Compliant	(including escape sequences in replacement (\L, \l, \U, \u, \E), modifiers M/m and combination of modifier g and number, but excluding modifier e)
`t label`	Compliant
`w filename`	Compliant	(including double address extension but not writing to stdout or stderr)
`x`	Compliant
`y`	Compliant
`#`	Compliant	(comments start anywhere in the line.)

Compliant means compliant with GNU sed description.

Testing

Description

The working of PythonSed is tested and compared to the behavior of GNU sed with a set of tests and a testing utility.

The tests are either coded in text files with .suite extension or may be stored in test directories as standard sed scripts.

The test suites are:

`unit.suite`	a text file containing unitary tests
`chang.suite`	a text file containing scripts from Roger Chang web site
`test-suite1`	a set of scripts from GNU sed test suite
`test-suite2`	a set of scripts from the seder's grab-bag, Rosetta code web site and GitHub (lisp!)
`test-suite3`	additional unitary tests better stored in a folder with some extra data text files
`test-suite4`	a set of scripts from the sed $HOME

Note that the goal of these tests is not to check the correctness of the scripts but to verify that `sed.py` and GNU sed have the same behavior.

Testing utility

Tests are launched and checked with the test-suite.py Python script. This script uses either PythonSed package to run the sed scripts, or any sed executable. This enables to compare the working of PythonSed with the one of GNU sed.

The calling syntax is:

test-suite.py <testsuite> [number] [-b executable] [-x list of script references]

Parameters
`testsuite`	either a text file with .suite extension or a test directory
`number`	an optional reference number of a test, when present only this tests is run
`executable`	an optional name or path of a sed executable to use for testing
`list of script references`	an optional list of tests to exclude for instance when a feature is not implemented. A script reference is either the title of the test for tests stored in modules, or the the name of the script file.

Text file test suites

When tests are stored in a text file (with .suite extension), they are made of four elements:

the title of the test
the script itself
the input list of lines
the expected result

The four elements of a test are separated with lines made of three identical characters, for instance:

---
Test substitution with global flag
---
s/an/AN/g
---
In Xanadu did Kubhla Khan
---
In XANadu did Kubhla KhAN
---

Note also that:

the script section may be empty, enabling to test a script on various data without repeating the script.
The input and output sections may be empty, enabling to test various scripts on the same data, without repeating the data.
Flags are set with a comment on the first line. As usual, #n stops autoprint mode and extended regexp mode is set with #r or #nr.
The expected result may be ??? when the test has no result and ends with an error.
All text outside the test, i.e. before first delimiter or after last delimiter, is ignored and acts like a comment.

Directory test suites

When tests are stored in a directory, they are represented by three or four files with same name but different extensions:

the script itself, with '.sed' extension
the input of the script, with '.inp' extension
the expected result of the script, with '.good' extension
possibly a file, with '.flags' extension, containing the sed switches -n and/or -r.

Some other files may be used when using reading or writing commands in scripts. In that case, the expected written files must be named with extension '.wgoodN' where N is the number of the expected written file.

Timing

A python implementation of sed has to face legitimate questions about timing. Fortunately, results are not bad. Unfortunately, they seem correlated with version number. Timings are given in seconds.

Platform	GNU sed 4.2.1	sed.py python 2.6	sed.py python 2.7	sed.py python 3.4
Windows7, Intel Xeon 3.2 GHz, 6 Gb RAM	19.4	19.1	22.6	26.9
Windows XP, Intel Pentium4 3.2 GHz, 4 Gb RAM	47.5	50.7	56.5	71.2
Linux, Intel Pentium4 3.2 GHz, 4 Gb RAM	-	-	51.0	-

Test conditions:

Only script files are used (scripts from folders testsuiteN). This is to avoid measuring the time to extract scripts, inputs and results from .suite files.
The given values are averaged from three consecutive test runs.

To do list

At one moment, one has to decide what will be in the release to come, and what can be delayed. Here are some features which would be nice to have but can be delayed to a future version.

Better error handling when testing (the error message could be tested)
Use PythonSed as a basis for a sed debugger.
...

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.github/workflows		.github/workflows
src/PythonSed		src/PythonSed
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
hatch.toml		hatch.toml
pyproject.toml		pyproject.toml
run_coverage_tests.sh		run_coverage_tests.sh
sed.html		sed.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PythonSed

Contents

General Information

Description

Compatibility

License

Install

Usage as a command line utility

Usage as a Python module

sed dialect

Addresses

Regular expressions

Extended regular expressions

Commands

Testing

Description

Testing utility

Text file test suites

Directory test suites

Timing

To do list

About

Releases

Packages

Contributors 5

Languages

License

fschaeck/PythonSed

Folders and files

Latest commit

History

Repository files navigation

PythonSed

Contents

General Information

Description

Compatibility

License

Install

Usage as a command line utility

Usage as a Python module

sed dialect

Addresses

Regular expressions

Extended regular expressions

Commands

Testing

Description

Testing utility

Text file test suites

Directory test suites

Timing

To do list

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages