Validation is not strict enough #213

stefanv · 2019-04-24T04:18:05Z

The docs suggest that python -m numpydoc can be used to validate docstrings. However, numpydoc reports only the most egregious breakages of the standard.

If we want to help users write high quality docstrings, we should provide a stronger checker.

In the following example, only the incorrect header gets caught:

def foo(x):
    """The first line should be a one line description, not a multiple
    paragraph sentence.

    References
    ----------
    These are out of place and incorrectly formatted.

    Some Heading
    ------------
    Unknown heading

    Parameters
    ----------
    x :
    I should specify a type for this parameter

    """
    return x

The text was updated successfully, but these errors were encountered:

jnothman · 2019-04-24T04:30:38Z

Is there a way to run a restructured text or sphinx linter over it?

stefanv · 2019-04-24T05:20:08Z

Easiest is probably to build s small sphinx project and compile it. I haven't had much luck using Sphinx as a library, but maybe in 2.0 it's easier.

anntzer · 2019-05-01T09:45:05Z

FWIW, I have a small wrapper at https://github.com/anntzer/speedoc that's intended to convert docstrings to man pages on the fly (for use à la man foo), and takes care of setting up a throwaway sphinx project behind the scenes. May be useful for this use case.
(right now only master is compatible with sphinx 2.0.)

stefanv · 2019-05-01T21:50:02Z

@anntzer Very nice!

calincru · 2019-06-19T12:45:38Z

Double that.

I stumbled upon a similar issue by following the dependency chain YCM -> ycmd -> jedi -> numpydoc: jedi tries to extract information about parameters and return values from docstrings. But even if the Parameters section exists, numpydoc raises an exception if, for instance, the Examples sections appears twice -- I got here by using networkx in vim with YCM which has this exact problem in its Graph class (1, 2).

IMO the solution would be to separate semantic analysis from parsing. Another quicker but hackier solution would be to throw the exceptions lazily, once an erroneous section is accessed (e.g., when trying to access Returns or Yields when both of them are specified -- which is a semantic error, or when trying to access Examples when it appears twice, etc).

datapythonista · 2019-07-16T10:29:23Z

In pandas we implemented stricter validation of docstrings. We defined a set of standards that extend numpydoc, and we enforce them in the CI (gradually, since fixing all the docstrings is a huge amount of work).

We mainly have a script that can be used to validate a single docstring and provide detailed information on errors, can be used to validate all the docs (for the CI), and also can generate a JSON report with the errors found.

The list of errors we validate can be found here:
https://github.com/pandas-dev/pandas/blob/master/scripts/validate_docstrings.py#L76

Some examples of things we validate:

Paragraphs in the summary must start with a capital letter and end with a period.
Parameters must have a type, str, int must be used instead of string, integer...
Parameters description must start with a capital letter and end with a period.
The order of the sections must be correct, and no unknown sections should be found
Examples must be valid doctests, and respect pep8
...

We currently have few things that are pandas specific, but it shouldn't be difficult to make the script 100% generic and move it to numpydoc. From our side we agree that the script could be moved to numpydoc if there is agreement with the rest of the community [1]. The main question is whether there is interest outside of pandas in a stricter standard, and if there can be agreement on what the standard should be.

DEPS: make a package for the ci/code_checks.sh pandas-dev/pandas#27342

larsoner · 2019-07-16T11:18:00Z

The main question is whether there is interest outside of pandas in a stricter standard

Yes from me, I've written and used validators multiple times, mostly for ensuring that all parameters are documented (in proper order).

and if there can be agreement on what the standard should be.

I think as long as the checks can be turned off via options, like you can with PEP8, we can have all the checks listed above.

For example, I have a fairly large codebase that I think will respect all of the above except:

Parameters description must start with a capital letter and end with a period.

Eventually I'd probably/maybe want it to, but not at first.

larsoner · 2019-07-16T11:22:59Z

(See also #13 where this was originally brought up five years ago!)

jnothman · 2019-07-25T11:35:55Z

I'd be very happy to see this tooling centralised and it makes sense to include it within numpydoc. @datapythonista do you intend to submit a PR?

rth · 2019-10-07T15:39:46Z

I'd be very happy to see this tooling centralised and it makes sense to include it within numpydoc.

Really looking forward to having some form of more advanced docstring validation script as part of numpydoc.

If you need some help with that @datapythonista please let us know. Otherwise maybe someone from the pandas team would also be interested/available in contributing it?

datapythonista · 2019-10-07T16:22:11Z

Yes, for what I know looks like there is agreement on moving our script to numpydoc.

I don't know much how numpydoc works internally, and I don't have time now to check in detail. Does someone have an opinion on whether we should move our code as-is, removing all pandas specific stuff. Or if it makes more sense to integrate it with the rest of numpydoc?

rth · 2019-10-07T16:31:25Z

I'm not too familiar with numpydoc either but I imagine that backward compatibility on python -m numpydoc would still need to be preserved. So maybe add an option here that would default to current validation but can be changed to the validation code from pandas and tag it as experimental?

Also I think running the doctests from the pandas validation script could be dropped, since pytest does that fine.

timhoffm · 2019-10-11T23:43:04Z

There's also pydocstyle, which can be used:

standalone https://github.com/PyCQA/pydocstyle
or as flake8 extension via https://gitlab.com/pycqa/flake8-docstrings

In matplotlib CI, we are currently using the latter, which allows per-file-ignores. - Not all .py files are created equally. For example, our .py files for sphinx-gallery examples follow different rules.

From a quick look, pydocstyle and the pandas checker have partial overlap, but each has functionality the other has not. It might be worth investigating if merging these two is beneficial.

larsoner · 2019-10-11T23:55:38Z

I use pydocstyle in some projects, and it is useful. However, being a static analysis tool it cannot for example parse whether the Parameters section has all elements listed when the docstring is generated/populated at import time. This is problematic for example in some scipy submodules (stats and ndimage I think?) where names are populated by a @docfiller decorator.

So even though there is some overlap, I suspect that the pandas solution should offer complementary functionality, at least in these cases. And merging them IMO is not really an option, since the dynamic checker does things that by definition the static checker can't.

larsoner · 2019-10-12T00:00:10Z

See also PyCQA/pydocstyle#185 for discussion of this static/dynamic difference

datapythonista · 2019-10-12T19:24:29Z

I discussed with @Nurdok about adding the pandas validations to pydocstyle and he's happy with the idea. I wasn't aware pydocstyle doesn't import the code, but that's surely a must for us. I'll double check, but, if adding that to pydocstyle is not an option, I think we'll move forward moving the pandas validations to numpydoc, and adding any that we don't have and we need.

Nurdok · 2019-10-20T09:45:37Z

Hi @datapythonista,

pydocstyle is a static linter and I don't see us adding code import any time soon. I'm sorry if that's problematic for you. Feel free to add any static checks you need to pydocstyle if you'd like.

datapythonista · 2019-10-20T22:05:45Z

Thanks for the info @Nurdok. We'll see what makes sense, but I think it makes sense to start by moving the pandas validation to numpydoc. As I said, at least for pandas obtaining the docstrings in a static way means that we wouldn't be able to validate even 50% of them.

datapythonista · 2019-10-25T18:18:23Z

I was having a look, and see that python -m numpydoc does not actually validate the docstring, but renders it as restructured text instead.

I'm not sure how we want to run the validation. I can think of few options:

Replace the rendering by validating the object and printing the list of errors
Use a --validate and have boh behaviors
Show the rendering and next show the errors
Use a script instead of running the module for the validation

Is people using python -m numpydoc to see the rendered version? Or was implemented just to see if it breaks? I think the first option in the list is the cleaner one, but not sure if this is being used, or if it makes sense to change the behavior

larsoner · 2019-10-25T18:24:33Z

Is people using python -m numpydoc to see the rendered version? Or was implemented just to see if it breaks? I think the first option in the list is the cleaner one, but not sure if this is being used, or if it makes sense to change the behavior

I'm not sure either. But there is little cost to keeping backward compat here, I would go for the second option (--validate to enable validation mode).

rth mentioned this issue Oct 7, 2019

validate_docstrings.py as a separate package? pandas-dev/pandas#28822

Closed

This was referenced Oct 7, 2019

add example of a good docstring for defaults and examples scikit-learn/scikit-learn#12356

Merged

Ensure all attributes are documented scikit-learn/scikit-learn#14312

Closed

datapythonista mentioned this issue Oct 21, 2019

Add docstring validation script (from pandas) #238

Merged

datapythonista mentioned this issue Oct 25, 2019

Adding --validate option __main__ and run new validation #240

Merged

larsoner closed this as completed in #240 Nov 13, 2019

simonjayhawkins mentioned this issue Feb 11, 2020

CLN: D300 Use """triple double quotes""" pandas-dev/pandas#31888

Merged

rprimet mentioned this issue Sep 22, 2021

Add numpydoc checks to CI jobs alphacsc/alphacsc#40

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Validation is not strict enough #213

Validation is not strict enough #213

stefanv commented Apr 24, 2019

jnothman commented Apr 24, 2019 via email

stefanv commented Apr 24, 2019

anntzer commented May 1, 2019

stefanv commented May 1, 2019

calincru commented Jun 19, 2019 •

edited

Loading

datapythonista commented Jul 16, 2019

larsoner commented Jul 16, 2019

larsoner commented Jul 16, 2019

jnothman commented Jul 25, 2019

rth commented Oct 7, 2019

datapythonista commented Oct 7, 2019

rth commented Oct 7, 2019 •

edited

Loading

timhoffm commented Oct 11, 2019 •

edited

Loading

larsoner commented Oct 11, 2019

larsoner commented Oct 12, 2019

datapythonista commented Oct 12, 2019

Nurdok commented Oct 20, 2019

datapythonista commented Oct 20, 2019

datapythonista commented Oct 25, 2019

larsoner commented Oct 25, 2019

Validation is not strict enough #213

Validation is not strict enough #213

Comments

stefanv commented Apr 24, 2019

jnothman commented Apr 24, 2019 via email

stefanv commented Apr 24, 2019

anntzer commented May 1, 2019

stefanv commented May 1, 2019

calincru commented Jun 19, 2019 • edited Loading

datapythonista commented Jul 16, 2019

larsoner commented Jul 16, 2019

larsoner commented Jul 16, 2019

jnothman commented Jul 25, 2019

rth commented Oct 7, 2019

datapythonista commented Oct 7, 2019

rth commented Oct 7, 2019 • edited Loading

timhoffm commented Oct 11, 2019 • edited Loading

larsoner commented Oct 11, 2019

larsoner commented Oct 12, 2019

datapythonista commented Oct 12, 2019

Nurdok commented Oct 20, 2019

datapythonista commented Oct 20, 2019

datapythonista commented Oct 25, 2019

larsoner commented Oct 25, 2019

calincru commented Jun 19, 2019 •

edited

Loading

rth commented Oct 7, 2019 •

edited

Loading

timhoffm commented Oct 11, 2019 •

edited

Loading