Skip to content

Validation is not strict enough #213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
stefanv opened this issue Apr 24, 2019 · 20 comments · Fixed by #240
Closed

Validation is not strict enough #213

stefanv opened this issue Apr 24, 2019 · 20 comments · Fixed by #240

Comments

@stefanv
Copy link
Contributor

stefanv commented Apr 24, 2019

The docs suggest that python -m numpydoc can be used to validate docstrings. However, numpydoc reports only the most egregious breakages of the standard.

If we want to help users write high quality docstrings, we should provide a stronger checker.

In the following example, only the incorrect header gets caught:

def foo(x):
    """The first line should be a one line description, not a multiple
    paragraph sentence.

    References
    ----------
    These are out of place and incorrectly formatted.

    Some Heading
    ------------
    Unknown heading

    Parameters
    ----------
    x :
    I should specify a type for this parameter

    """
    return x
@jnothman
Copy link
Member

jnothman commented Apr 24, 2019 via email

@stefanv
Copy link
Contributor Author

stefanv commented Apr 24, 2019

Easiest is probably to build s small sphinx project and compile it. I haven't had much luck using Sphinx as a library, but maybe in 2.0 it's easier.

@anntzer
Copy link
Contributor

anntzer commented May 1, 2019

FWIW, I have a small wrapper at https://github.com/anntzer/speedoc that's intended to convert docstrings to man pages on the fly (for use à la man foo), and takes care of setting up a throwaway sphinx project behind the scenes. May be useful for this use case.
(right now only master is compatible with sphinx 2.0.)

@stefanv
Copy link
Contributor Author

stefanv commented May 1, 2019

@anntzer Very nice!

@calincru
Copy link

calincru commented Jun 19, 2019

Double that.

I stumbled upon a similar issue by following the dependency chain YCM -> ycmd -> jedi -> numpydoc: jedi tries to extract information about parameters and return values from docstrings. But even if the Parameters section exists, numpydoc raises an exception if, for instance, the Examples sections appears twice -- I got here by using networkx in vim with YCM which has this exact problem in its Graph class (1, 2).

IMO the solution would be to separate semantic analysis from parsing. Another quicker but hackier solution would be to throw the exceptions lazily, once an erroneous section is accessed (e.g., when trying to access Returns or Yields when both of them are specified -- which is a semantic error, or when trying to access Examples when it appears twice, etc).

@datapythonista
Copy link
Contributor

In pandas we implemented stricter validation of docstrings. We defined a set of standards that extend numpydoc, and we enforce them in the CI (gradually, since fixing all the docstrings is a huge amount of work).

We mainly have a script that can be used to validate a single docstring and provide detailed information on errors, can be used to validate all the docs (for the CI), and also can generate a JSON report with the errors found.

The list of errors we validate can be found here:
https://github.com/pandas-dev/pandas/blob/master/scripts/validate_docstrings.py#L76

Some examples of things we validate:

  • Paragraphs in the summary must start with a capital letter and end with a period.
  • Parameters must have a type, str, int must be used instead of string, integer...
  • Parameters description must start with a capital letter and end with a period.
  • The order of the sections must be correct, and no unknown sections should be found
  • Examples must be valid doctests, and respect pep8
  • ...

We currently have few things that are pandas specific, but it shouldn't be difficult to make the script 100% generic and move it to numpydoc. From our side we agree that the script could be moved to numpydoc if there is agreement with the rest of the community [1]. The main question is whether there is interest outside of pandas in a stricter standard, and if there can be agreement on what the standard should be.

  1. DEPS: make a package for the ci/code_checks.sh pandas-dev/pandas#27342

@larsoner
Copy link
Collaborator

The main question is whether there is interest outside of pandas in a stricter standard

Yes from me, I've written and used validators multiple times, mostly for ensuring that all parameters are documented (in proper order).

and if there can be agreement on what the standard should be.

I think as long as the checks can be turned off via options, like you can with PEP8, we can have all the checks listed above.

For example, I have a fairly large codebase that I think will respect all of the above except:

Parameters description must start with a capital letter and end with a period.

Eventually I'd probably/maybe want it to, but not at first.

@larsoner
Copy link
Collaborator

(See also #13 where this was originally brought up five years ago!)

@jnothman
Copy link
Member

I'd be very happy to see this tooling centralised and it makes sense to include it within numpydoc. @datapythonista do you intend to submit a PR?

@rth
Copy link
Contributor

rth commented Oct 7, 2019

I'd be very happy to see this tooling centralised and it makes sense to include it within numpydoc.

Really looking forward to having some form of more advanced docstring validation script as part of numpydoc.

If you need some help with that @datapythonista please let us know. Otherwise maybe someone from the pandas team would also be interested/available in contributing it?

@datapythonista
Copy link
Contributor

Yes, for what I know looks like there is agreement on moving our script to numpydoc.

I don't know much how numpydoc works internally, and I don't have time now to check in detail. Does someone have an opinion on whether we should move our code as-is, removing all pandas specific stuff. Or if it makes more sense to integrate it with the rest of numpydoc?

@rth
Copy link
Contributor

rth commented Oct 7, 2019

I'm not too familiar with numpydoc either but I imagine that backward compatibility on python -m numpydoc would still need to be preserved. So maybe add an option here that would default to current validation but can be changed to the validation code from pandas and tag it as experimental?

Also I think running the doctests from the pandas validation script could be dropped, since pytest does that fine.

@timhoffm
Copy link
Contributor

timhoffm commented Oct 11, 2019

There's also pydocstyle, which can be used:

In matplotlib CI, we are currently using the latter, which allows per-file-ignores. - Not all .py files are created equally. For example, our .py files for sphinx-gallery examples follow different rules.

From a quick look, pydocstyle and the pandas checker have partial overlap, but each has functionality the other has not. It might be worth investigating if merging these two is beneficial.

@larsoner
Copy link
Collaborator

I use pydocstyle in some projects, and it is useful. However, being a static analysis tool it cannot for example parse whether the Parameters section has all elements listed when the docstring is generated/populated at import time. This is problematic for example in some scipy submodules (stats and ndimage I think?) where names are populated by a @docfiller decorator.

So even though there is some overlap, I suspect that the pandas solution should offer complementary functionality, at least in these cases. And merging them IMO is not really an option, since the dynamic checker does things that by definition the static checker can't.

@larsoner
Copy link
Collaborator

See also PyCQA/pydocstyle#185 for discussion of this static/dynamic difference

@datapythonista
Copy link
Contributor

I discussed with @Nurdok about adding the pandas validations to pydocstyle and he's happy with the idea. I wasn't aware pydocstyle doesn't import the code, but that's surely a must for us. I'll double check, but, if adding that to pydocstyle is not an option, I think we'll move forward moving the pandas validations to numpydoc, and adding any that we don't have and we need.

@Nurdok
Copy link

Nurdok commented Oct 20, 2019

Hi @datapythonista,

pydocstyle is a static linter and I don't see us adding code import any time soon. I'm sorry if that's problematic for you. Feel free to add any static checks you need to pydocstyle if you'd like.

@datapythonista
Copy link
Contributor

Thanks for the info @Nurdok. We'll see what makes sense, but I think it makes sense to start by moving the pandas validation to numpydoc. As I said, at least for pandas obtaining the docstrings in a static way means that we wouldn't be able to validate even 50% of them.

@datapythonista
Copy link
Contributor

I was having a look, and see that python -m numpydoc does not actually validate the docstring, but renders it as restructured text instead.

I'm not sure how we want to run the validation. I can think of few options:

  • Replace the rendering by validating the object and printing the list of errors
  • Use a --validate and have boh behaviors
  • Show the rendering and next show the errors
  • Use a script instead of running the module for the validation

Is people using python -m numpydoc to see the rendered version? Or was implemented just to see if it breaks? I think the first option in the list is the cleaner one, but not sure if this is being used, or if it makes sense to change the behavior

@larsoner
Copy link
Collaborator

Is people using python -m numpydoc to see the rendered version? Or was implemented just to see if it breaks? I think the first option in the list is the cleaner one, but not sure if this is being used, or if it makes sense to change the behavior

I'm not sure either. But there is little cost to keeping backward compat here, I would go for the second option (--validate to enable validation mode).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants