DOC: Add regex example in str.split docstring #26267

vandenn · 2019-05-02T15:26:25Z

closes Document Using Regex for str.split #25296
tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Adding the regex example in the str.split() documentation to make people aware of the need to escape special characters when using regular expressions as the pattern.

Error from docstring validation already exists in master's HEAD with no modifications. No additional error was introduced by the new docstring content.
Output of docstring validation:

$ python scripts/validate_docstrings.py pandas.Series.str.split

################################################################################
##################### Docstring (pandas.Series.str.split)  #####################
################################################################################

Split strings around given separator/delimiter.

Splits the string in the Series/Index from the beginning,
at the specified delimiter string. Equivalent to :meth:`str.split`.

Parameters
----------
pat : str, optional
    String or regular expression to split on.
    If not specified, split on whitespace.
n : int, default -1 (all)
    Limit number of splits in output.
    ``None``, 0 and -1 will be interpreted as return all splits.
expand : bool, default False
    Expand the splitted strings into separate columns.

    * If ``True``, return DataFrame/MultiIndex expanding dimensionality.
    * If ``False``, return Series/Index, containing lists of strings.

Returns
-------
Series, Index, DataFrame or MultiIndex
    Type matches caller unless ``expand=True`` (see Notes).

See Also
--------
 Series.str.split : Split strings around given separator/delimiter.
 Series.str.rsplit : Splits string around given separator/delimiter,
 starting from the right.
 Series.str.join : Join lists contained as elements in the Series/Index
 with passed delimiter.
 str.split : Standard library version for split.
 str.rsplit : Standard library version for rsplit.

Notes
-----
The handling of the `n` keyword depends on the number of found splits:

- If found splits > `n`,  make first `n` splits only
- If found splits <= `n`, make all splits
- If for a certain row the number of found splits < `n`,
  append `None` for padding up to `n` if ``expand=True``

If using ``expand=True``, Series and Index callers return DataFrame and
MultiIndex objects, respectively.

Examples
--------
>>> s = pd.Series(["this is a regular sentence",
"https://docs.python.org/3/tutorial/index.html", np.nan])

In the default setting, the string is split by whitespace.

>>> s.str.split()
0                   [this, is, a, regular, sentence]
1    [https://docs.python.org/3/tutorial/index.html]
2                                                NaN
dtype: object

Without the `n` parameter, the outputs of `rsplit` and `split`
are identical.

>>> s.str.rsplit()
0                   [this, is, a, regular, sentence]
1    [https://docs.python.org/3/tutorial/index.html]
2                                                NaN
dtype: object

The `n` parameter can be used to limit the number of splits on the
delimiter. The outputs of `split` and `rsplit` are different.

>>> s.str.split(n=2)
0                     [this, is, a regular sentence]
1    [https://docs.python.org/3/tutorial/index.html]
2                                                NaN
dtype: object

>>> s.str.rsplit(n=2)
0                     [this is a, regular, sentence]
1    [https://docs.python.org/3/tutorial/index.html]
2                                                NaN
dtype: object

The `pat` parameter can be used to split by other characters.

>>> s.str.split(pat = "/")
0                         [this is a regular sentence]
1    [https:, , docs.python.org, 3, tutorial, index...
2                                                  NaN
dtype: object

When using ``expand=True``, the split elements will expand out into
separate columns. If NaN is present, it is propagated throughout
the columns during the split.

>>> s.str.split(expand=True)
                                               0     1     2        3
0                                           this    is     a  regular
1  https://docs.python.org/3/tutorial/index.html  None  None     None
2                                            NaN   NaN   NaN      NaN 
             4
0     sentence
1         None
2          NaN

For slightly more complex use cases like splitting the html document name
from a url, a combination of parameter settings can be used.

>>> s.str.rsplit("/", n=1, expand=True)
                                    0           1
0          this is a regular sentence        None
1  https://docs.python.org/3/tutorial  index.html
2                                 NaN         NaN

Remember to escape special characters when explicitly using regular
expressions.

>>> s = pd.Series(["1+1=2", np.nan])

>>> s.str.split("\+|=", expand=True)
     0    1    2
0    1    1    2
1  NaN  NaN  NaN

################################################################################
################################## Validation ##################################
################################################################################

3 Errors found:
	Examples do not pass tests
	flake8 error: E902 TokenError: EOF in multi-line statement
	flake8 error: E999 SyntaxError: invalid syntax

################################################################################
################################### Doctests ###################################
################################################################################

**********************************************************************
Line 50, in pandas.Series.str.split
Failed example:
    s = pd.Series(["this is a regular sentence",
Exception raised:
    Traceback (most recent call last):
      File "/home/evan/anaconda3/envs/pandas/lib/python3.6/doctest.py", line 1330, in __run
        compileflags, 1), test.globs)
      File "<doctest pandas.Series.str.split[0]>", line 1
        s = pd.Series(["this is a regular sentence",
                                                   ^
    SyntaxError: unexpected EOF while parsing
**********************************************************************
Line 55, in pandas.Series.str.split
Failed example:
    s.str.split()
Exception raised:
    Traceback (most recent call last):
      File "/home/evan/anaconda3/envs/pandas/lib/python3.6/doctest.py", line 1330, in __run
        compileflags, 1), test.globs)
      File "<doctest pandas.Series.str.split[1]>", line 1, in <module>
        s.str.split()
    NameError: name 's' is not defined
**********************************************************************
Line 64, in pandas.Series.str.split
Failed example:
    s.str.rsplit()
Exception raised:
    Traceback (most recent call last):
      File "/home/evan/anaconda3/envs/pandas/lib/python3.6/doctest.py", line 1330, in __run
        compileflags, 1), test.globs)
      File "<doctest pandas.Series.str.split[2]>", line 1, in <module>
        s.str.rsplit()
    NameError: name 's' is not defined
**********************************************************************
Line 73, in pandas.Series.str.split
Failed example:
    s.str.split(n=2)
Exception raised:
    Traceback (most recent call last):
      File "/home/evan/anaconda3/envs/pandas/lib/python3.6/doctest.py", line 1330, in __run
        compileflags, 1), test.globs)
      File "<doctest pandas.Series.str.split[3]>", line 1, in <module>
        s.str.split(n=2)
    NameError: name 's' is not defined
**********************************************************************
Line 79, in pandas.Series.str.split
Failed example:
    s.str.rsplit(n=2)
Exception raised:
    Traceback (most recent call last):
      File "/home/evan/anaconda3/envs/pandas/lib/python3.6/doctest.py", line 1330, in __run
        compileflags, 1), test.globs)
      File "<doctest pandas.Series.str.split[4]>", line 1, in <module>
        s.str.rsplit(n=2)
    NameError: name 's' is not defined
**********************************************************************
Line 87, in pandas.Series.str.split
Failed example:
    s.str.split(pat = "/")
Exception raised:
    Traceback (most recent call last):
      File "/home/evan/anaconda3/envs/pandas/lib/python3.6/doctest.py", line 1330, in __run
        compileflags, 1), test.globs)
      File "<doctest pandas.Series.str.split[5]>", line 1, in <module>
        s.str.split(pat = "/")
    NameError: name 's' is not defined
**********************************************************************
Line 97, in pandas.Series.str.split
Failed example:
    s.str.split(expand=True)
Exception raised:
    Traceback (most recent call last):
      File "/home/evan/anaconda3/envs/pandas/lib/python3.6/doctest.py", line 1330, in __run
        compileflags, 1), test.globs)
      File "<doctest pandas.Series.str.split[6]>", line 1, in <module>
        s.str.split(expand=True)
    NameError: name 's' is not defined
**********************************************************************
Line 110, in pandas.Series.str.split
Failed example:
    s.str.rsplit("/", n=1, expand=True)
Exception raised:
    Traceback (most recent call last):
      File "/home/evan/anaconda3/envs/pandas/lib/python3.6/doctest.py", line 1330, in __run
        compileflags, 1), test.globs)
      File "<doctest pandas.Series.str.split[7]>", line 1, in <module>
        s.str.rsplit("/", n=1, expand=True)
    NameError: name 's' is not defined

pep8speaks · 2019-05-02T15:26:30Z

Hello @vandenn! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-03 05:05:49 UTC

vandenn · 2019-05-02T15:41:20Z

Update: Fixed the PEP8 issues with the force-pushed commits.

WillAyd · 2019-05-02T15:57:13Z

pandas/core/strings.py

+
+    >>> s = pd.Series(["1+1=2", np.nan])
+
+    >>> s.str.split("\\+|=", expand=True)


Could you use a raw string here instead? That would be more idiomatic to Python

Suggested change

>>> s.str.split("\\+|=", expand=True)

>>> s.str.split(r"\+|=", expand=True)

I tried doing it this way but PyCharm's PEP8 checker still detects the \+ as an invalid escape sequence. Should I just try force-pushing and checking if it will pass the pep8speaks check?

Update: I've force-pushed an edit that applies the changes I've outlined above.
@WillAyd pep8speaks threw out a W605 complaining about the invalid escape sequence. Do I keep these changes or do I revert it back to the "\\+|=" string?

Hmm OK that's strange. Can you check the code base for how we've handled elsewhere? I'd be surprised if this is the first time we've seen this

I turned the entire docstring into a raw string by adding an r before the triple quotation marks at the beginning. This removed the PEP8 error both in my local machine and from pep8speaks. Kindly review. Thanks!

codecov · 2019-05-02T16:06:22Z

Codecov Report

Merging #26267 into master will decrease coverage by <.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #26267      +/-   ##
==========================================
- Coverage   91.98%   91.97%   -0.01%     
==========================================
  Files         175      175              
  Lines       52386    52386              
==========================================
- Hits        48186    48182       -4     
- Misses       4200     4204       +4

Flag	Coverage Δ
#multiple	`90.52% <ø> (ø)`	⬆️
#single	`40.71% <ø> (-0.15%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/strings.py	`98.86% <ø> (ø)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`96.9% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d49ebd4...c6eac52. Read the comment docs.

codecov · 2019-05-02T16:06:32Z

Codecov Report

Merging #26267 into master will decrease coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #26267      +/-   ##
==========================================
- Coverage   91.99%   91.98%   -0.01%     
==========================================
  Files         175      175              
  Lines       52379    52379              
==========================================
- Hits        48184    48180       -4     
- Misses       4195     4199       +4

Flag	Coverage Δ
#multiple	`90.53% <100%> (ø)`	⬆️
#single	`40.72% <100%> (-0.15%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/strings.py	`98.86% <100%> (ø)`	⬆️
pandas/io/gbq.py	`78.94% <0%> (-10.53%)`	⬇️
pandas/core/frame.py	`96.9% <0%> (-0.12%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0989339...ceb29a8. Read the comment docs.

WillAyd

Minor nit but otherwise lgtm. @gfyoung care to take a look and merge if satisfied?

pandas/core/strings.py

gfyoung · 2019-05-03T05:50:20Z

Thanks @vandenn !

vandenn · 2019-05-03T05:55:06Z

Thanks @WillAyd and @gfyoung !

Closes pandas-devgh-25296

vandenn force-pushed the add-regex-docs-to-split-str branch 2 times, most recently from a4c67a3 to c6eac52 Compare May 2, 2019 15:40

WillAyd requested changes May 2, 2019

View reviewed changes

WillAyd added the Docs label May 2, 2019

vandenn force-pushed the add-regex-docs-to-split-str branch 2 times, most recently from f07c0a7 to a0e3d8c Compare May 3, 2019 02:37

WillAyd requested changes May 3, 2019

View reviewed changes

pandas/core/strings.py Outdated Show resolved Hide resolved

DOC: Add regex example in str.split docstring

ceb29a8

vandenn force-pushed the add-regex-docs-to-split-str branch from a0e3d8c to ceb29a8 Compare May 3, 2019 05:05

WillAyd approved these changes May 3, 2019

View reviewed changes

WillAyd added this to the 0.25.0 milestone May 3, 2019

gfyoung merged commit e854ccf into pandas-dev:master May 3, 2019

vandenn deleted the add-regex-docs-to-split-str branch May 3, 2019 05:54

vandenn added a commit to vandenn/pandas that referenced this pull request May 3, 2019

DOC: Add regex example in str.split docstring (pandas-dev#26267)

e5db601

Closes pandas-devgh-25296

vandenn added a commit to vandenn/pandas that referenced this pull request May 3, 2019

DOC: Add regex example in str.split docstring (pandas-dev#26267) (#2)

79e3613

Closes pandas-devgh-25296

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DOC: Add regex example in str.split docstring #26267

DOC: Add regex example in str.split docstring #26267

Uh oh!

vandenn commented May 2, 2019 •

edited

Loading

Uh oh!

pep8speaks commented May 2, 2019 •

edited

Loading

Uh oh!

vandenn commented May 2, 2019

Uh oh!

WillAyd May 2, 2019

Uh oh!

vandenn May 2, 2019

Uh oh!

vandenn May 2, 2019

Uh oh!

WillAyd May 2, 2019

Uh oh!

vandenn May 3, 2019

Uh oh!

codecov bot commented May 2, 2019

Uh oh!

codecov bot commented May 2, 2019 •

edited

Loading

Uh oh!

WillAyd left a comment

Uh oh!

Uh oh!

gfyoung commented May 3, 2019

Uh oh!

vandenn commented May 3, 2019

Uh oh!

Uh oh!


		>>> s = pd.Series(["1+1=2", np.nan])

		>>> s.str.split("\\+\|=", expand=True)

	>>> s.str.split("\\+\|=", expand=True)
	>>> s.str.split(r"\+\|=", expand=True)

Uh oh!

DOC: Add regex example in str.split docstring #26267

DOC: Add regex example in str.split docstring #26267

Uh oh!

Conversation

vandenn commented May 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pep8speaks commented May 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2019-05-03 05:05:49 UTC

Uh oh!

vandenn commented May 2, 2019

Uh oh!

WillAyd May 2, 2019

Choose a reason for hiding this comment

Uh oh!

vandenn May 2, 2019

Choose a reason for hiding this comment

Uh oh!

vandenn May 2, 2019

Choose a reason for hiding this comment

Uh oh!

WillAyd May 2, 2019

Choose a reason for hiding this comment

Uh oh!

vandenn May 3, 2019

Choose a reason for hiding this comment

Uh oh!

codecov bot commented May 2, 2019

Codecov Report

Uh oh!

codecov bot commented May 2, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

gfyoung commented May 3, 2019

Uh oh!

vandenn commented May 3, 2019

Uh oh!

Uh oh!

vandenn commented May 2, 2019 •

edited

Loading

pep8speaks commented May 2, 2019 •

edited

Loading

codecov bot commented May 2, 2019 •

edited

Loading