Today I would like to tell you a funny story about a problem that I faced at work.
We are a security company and we have many systems that process logs for various purposes. Many of these are written in Python and basically split the log lines to work on specific fields.
In Python, the
split function splits on a specific string if specified, otherwise on spaces (and then you can access the result list by index as usual):
s = "Hello world! How are you?" s.split() Out: ['Hello', 'world!', 'How', 'are', 'you?'] s.split("!") Out: ['Hello world', ' How are you?'] s.split("!") Out: 'Hello world'
Based on this, we had the need to decide through a configuration file on which character/string split the lines. The solution that first came out was the following:
def get_sep(): sep = read_conf["separator"] if not sep: return " " return sep # Main function s.split(get_sep())
Basically if we don't specify a separator, it will revert to the standard space. Great! Everything works, we are all happy, but... Take a look at the following string:
s = "Useless useless2 useless3 interesting_field useless4"
We want the
interesting_field which is the index number 3, so no problem, let's do it!
s.split() Out: 'interesting_field'
Great! But we have a custom separator now, let's use it:
s.split(" ") Out: 'useless3'
Oh dear, why is this happening? Well, it took me all morning to get it! Python documentation is clear enough:
If sep is not specified or is None, a different splitting algorithm is applied: runs of consecutive whitespace are regarded as a single separator, and the result will contain no empty strings at the start or end if the string has leading or trailing whitespace. Consequently, splitting an empty string or a string consisting of just whitespace with a None separator returns .
Indeed, if you look closely, our example string has a double space between
The right (and cleaner) solution is to just return
def get_sep(): sep = read_conf["separator"] if not sep: return None return sep # Main function s.split(get_sep())