Skip to content

DOC: Add regex example in str.split docstring #26267

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 3, 2019

Conversation

vandenn
Copy link
Contributor

@vandenn vandenn commented May 2, 2019

Adding the regex example in the str.split() documentation to make people aware of the need to escape special characters when using regular expressions as the pattern.

Error from docstring validation already exists in master's HEAD with no modifications. No additional error was introduced by the new docstring content.
Output of docstring validation:

$ python scripts/validate_docstrings.py pandas.Series.str.split

################################################################################
##################### Docstring (pandas.Series.str.split)  #####################
################################################################################

Split strings around given separator/delimiter.

Splits the string in the Series/Index from the beginning,
at the specified delimiter string. Equivalent to :meth:`str.split`.

Parameters
----------
pat : str, optional
    String or regular expression to split on.
    If not specified, split on whitespace.
n : int, default -1 (all)
    Limit number of splits in output.
    ``None``, 0 and -1 will be interpreted as return all splits.
expand : bool, default False
    Expand the splitted strings into separate columns.

    * If ``True``, return DataFrame/MultiIndex expanding dimensionality.
    * If ``False``, return Series/Index, containing lists of strings.

Returns
-------
Series, Index, DataFrame or MultiIndex
    Type matches caller unless ``expand=True`` (see Notes).

See Also
--------
 Series.str.split : Split strings around given separator/delimiter.
 Series.str.rsplit : Splits string around given separator/delimiter,
 starting from the right.
 Series.str.join : Join lists contained as elements in the Series/Index
 with passed delimiter.
 str.split : Standard library version for split.
 str.rsplit : Standard library version for rsplit.

Notes
-----
The handling of the `n` keyword depends on the number of found splits:

- If found splits > `n`,  make first `n` splits only
- If found splits <= `n`, make all splits
- If for a certain row the number of found splits < `n`,
  append `None` for padding up to `n` if ``expand=True``

If using ``expand=True``, Series and Index callers return DataFrame and
MultiIndex objects, respectively.

Examples
--------
>>> s = pd.Series(["this is a regular sentence",
"https://docs.python.org/3/tutorial/index.html", np.nan])

In the default setting, the string is split by whitespace.

>>> s.str.split()
0                   [this, is, a, regular, sentence]
1    [https://docs.python.org/3/tutorial/index.html]
2                                                NaN
dtype: object

Without the `n` parameter, the outputs of `rsplit` and `split`
are identical.

>>> s.str.rsplit()
0                   [this, is, a, regular, sentence]
1    [https://docs.python.org/3/tutorial/index.html]
2                                                NaN
dtype: object

The `n` parameter can be used to limit the number of splits on the
delimiter. The outputs of `split` and `rsplit` are different.

>>> s.str.split(n=2)
0                     [this, is, a regular sentence]
1    [https://docs.python.org/3/tutorial/index.html]
2                                                NaN
dtype: object

>>> s.str.rsplit(n=2)
0                     [this is a, regular, sentence]
1    [https://docs.python.org/3/tutorial/index.html]
2                                                NaN
dtype: object

The `pat` parameter can be used to split by other characters.

>>> s.str.split(pat = "/")
0                         [this is a regular sentence]
1    [https:, , docs.python.org, 3, tutorial, index...
2                                                  NaN
dtype: object

When using ``expand=True``, the split elements will expand out into
separate columns. If NaN is present, it is propagated throughout
the columns during the split.

>>> s.str.split(expand=True)
                                               0     1     2        3
0                                           this    is     a  regular
1  https://docs.python.org/3/tutorial/index.html  None  None     None
2                                            NaN   NaN   NaN      NaN 
             4
0     sentence
1         None
2          NaN

For slightly more complex use cases like splitting the html document name
from a url, a combination of parameter settings can be used.

>>> s.str.rsplit("/", n=1, expand=True)
                                    0           1
0          this is a regular sentence        None
1  https://docs.python.org/3/tutorial  index.html
2                                 NaN         NaN

Remember to escape special characters when explicitly using regular
expressions.

>>> s = pd.Series(["1+1=2", np.nan])

>>> s.str.split("\+|=", expand=True)
     0    1    2
0    1    1    2
1  NaN  NaN  NaN

################################################################################
################################## Validation ##################################
################################################################################

3 Errors found:
	Examples do not pass tests
	flake8 error: E902 TokenError: EOF in multi-line statement
	flake8 error: E999 SyntaxError: invalid syntax

################################################################################
################################### Doctests ###################################
################################################################################

**********************************************************************
Line 50, in pandas.Series.str.split
Failed example:
    s = pd.Series(["this is a regular sentence",
Exception raised:
    Traceback (most recent call last):
      File "/home/evan/anaconda3/envs/pandas/lib/python3.6/doctest.py", line 1330, in __run
        compileflags, 1), test.globs)
      File "<doctest pandas.Series.str.split[0]>", line 1
        s = pd.Series(["this is a regular sentence",
                                                   ^
    SyntaxError: unexpected EOF while parsing
**********************************************************************
Line 55, in pandas.Series.str.split
Failed example:
    s.str.split()
Exception raised:
    Traceback (most recent call last):
      File "/home/evan/anaconda3/envs/pandas/lib/python3.6/doctest.py", line 1330, in __run
        compileflags, 1), test.globs)
      File "<doctest pandas.Series.str.split[1]>", line 1, in <module>
        s.str.split()
    NameError: name 's' is not defined
**********************************************************************
Line 64, in pandas.Series.str.split
Failed example:
    s.str.rsplit()
Exception raised:
    Traceback (most recent call last):
      File "/home/evan/anaconda3/envs/pandas/lib/python3.6/doctest.py", line 1330, in __run
        compileflags, 1), test.globs)
      File "<doctest pandas.Series.str.split[2]>", line 1, in <module>
        s.str.rsplit()
    NameError: name 's' is not defined
**********************************************************************
Line 73, in pandas.Series.str.split
Failed example:
    s.str.split(n=2)
Exception raised:
    Traceback (most recent call last):
      File "/home/evan/anaconda3/envs/pandas/lib/python3.6/doctest.py", line 1330, in __run
        compileflags, 1), test.globs)
      File "<doctest pandas.Series.str.split[3]>", line 1, in <module>
        s.str.split(n=2)
    NameError: name 's' is not defined
**********************************************************************
Line 79, in pandas.Series.str.split
Failed example:
    s.str.rsplit(n=2)
Exception raised:
    Traceback (most recent call last):
      File "/home/evan/anaconda3/envs/pandas/lib/python3.6/doctest.py", line 1330, in __run
        compileflags, 1), test.globs)
      File "<doctest pandas.Series.str.split[4]>", line 1, in <module>
        s.str.rsplit(n=2)
    NameError: name 's' is not defined
**********************************************************************
Line 87, in pandas.Series.str.split
Failed example:
    s.str.split(pat = "/")
Exception raised:
    Traceback (most recent call last):
      File "/home/evan/anaconda3/envs/pandas/lib/python3.6/doctest.py", line 1330, in __run
        compileflags, 1), test.globs)
      File "<doctest pandas.Series.str.split[5]>", line 1, in <module>
        s.str.split(pat = "/")
    NameError: name 's' is not defined
**********************************************************************
Line 97, in pandas.Series.str.split
Failed example:
    s.str.split(expand=True)
Exception raised:
    Traceback (most recent call last):
      File "/home/evan/anaconda3/envs/pandas/lib/python3.6/doctest.py", line 1330, in __run
        compileflags, 1), test.globs)
      File "<doctest pandas.Series.str.split[6]>", line 1, in <module>
        s.str.split(expand=True)
    NameError: name 's' is not defined
**********************************************************************
Line 110, in pandas.Series.str.split
Failed example:
    s.str.rsplit("/", n=1, expand=True)
Exception raised:
    Traceback (most recent call last):
      File "/home/evan/anaconda3/envs/pandas/lib/python3.6/doctest.py", line 1330, in __run
        compileflags, 1), test.globs)
      File "<doctest pandas.Series.str.split[7]>", line 1, in <module>
        s.str.rsplit("/", n=1, expand=True)
    NameError: name 's' is not defined

@pep8speaks
Copy link

pep8speaks commented May 2, 2019

Hello @vandenn! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-05-03 05:05:49 UTC

@vandenn vandenn force-pushed the add-regex-docs-to-split-str branch 2 times, most recently from a4c67a3 to c6eac52 Compare May 2, 2019 15:40
@vandenn
Copy link
Contributor Author

vandenn commented May 2, 2019

Update: Fixed the PEP8 issues with the force-pushed commits.


>>> s = pd.Series(["1+1=2", np.nan])

>>> s.str.split("\\+|=", expand=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use a raw string here instead? That would be more idiomatic to Python

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
>>> s.str.split("\\+|=", expand=True)
>>> s.str.split(r"\+|=", expand=True)

I tried doing it this way but PyCharm's PEP8 checker still detects the \+ as an invalid escape sequence. Should I just try force-pushing and checking if it will pass the pep8speaks check?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: I've force-pushed an edit that applies the changes I've outlined above.
@WillAyd pep8speaks threw out a W605 complaining about the invalid escape sequence. Do I keep these changes or do I revert it back to the "\\+|=" string?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm OK that's strange. Can you check the code base for how we've handled elsewhere? I'd be surprised if this is the first time we've seen this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I turned the entire docstring into a raw string by adding an r before the triple quotation marks at the beginning. This removed the PEP8 error both in my local machine and from pep8speaks. Kindly review. Thanks!

@WillAyd WillAyd added the Docs label May 2, 2019
@codecov
Copy link

codecov bot commented May 2, 2019

Codecov Report

Merging #26267 into master will decrease coverage by <.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26267      +/-   ##
==========================================
- Coverage   91.98%   91.97%   -0.01%     
==========================================
  Files         175      175              
  Lines       52386    52386              
==========================================
- Hits        48186    48182       -4     
- Misses       4200     4204       +4
Flag Coverage Δ
#multiple 90.52% <ø> (ø) ⬆️
#single 40.71% <ø> (-0.15%) ⬇️
Impacted Files Coverage Δ
pandas/core/strings.py 98.86% <ø> (ø) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d49ebd4...c6eac52. Read the comment docs.

@codecov
Copy link

codecov bot commented May 2, 2019

Codecov Report

Merging #26267 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #26267      +/-   ##
==========================================
- Coverage   91.99%   91.98%   -0.01%     
==========================================
  Files         175      175              
  Lines       52379    52379              
==========================================
- Hits        48184    48180       -4     
- Misses       4195     4199       +4
Flag Coverage Δ
#multiple 90.53% <100%> (ø) ⬆️
#single 40.72% <100%> (-0.15%) ⬇️
Impacted Files Coverage Δ
pandas/core/strings.py 98.86% <100%> (ø) ⬆️
pandas/io/gbq.py 78.94% <0%> (-10.53%) ⬇️
pandas/core/frame.py 96.9% <0%> (-0.12%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0989339...ceb29a8. Read the comment docs.

@vandenn vandenn force-pushed the add-regex-docs-to-split-str branch 2 times, most recently from f07c0a7 to a0e3d8c Compare May 3, 2019 02:37
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit but otherwise lgtm. @gfyoung care to take a look and merge if satisfied?

@vandenn vandenn force-pushed the add-regex-docs-to-split-str branch from a0e3d8c to ceb29a8 Compare May 3, 2019 05:05
@WillAyd WillAyd added this to the 0.25.0 milestone May 3, 2019
@gfyoung gfyoung merged commit e854ccf into pandas-dev:master May 3, 2019
@gfyoung
Copy link
Member

gfyoung commented May 3, 2019

Thanks @vandenn !

@vandenn vandenn deleted the add-regex-docs-to-split-str branch May 3, 2019 05:54
@vandenn
Copy link
Contributor Author

vandenn commented May 3, 2019

Thanks @WillAyd and @gfyoung !

vandenn added a commit to vandenn/pandas that referenced this pull request May 3, 2019
vandenn added a commit to vandenn/pandas that referenced this pull request May 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Document Using Regex for str.split
4 participants