Add generic hook for validating all known schemas #33

ssbarnea · 2022-02-04T12:00:20Z

By looking at https://github.com/sirosen/check-jsonschema/blob/main/.pre-commit-hooks.yaml I realise that the currently provided set of hooks does not scale well on real-life repositories, where you might easily have more than 10+ files using schemas.

I would like to suggest a generic option that would detect all files that match known schemas and validate them, with an option to fallback to offline more, so it would also work in locked down environments like https://pre-commit.ci

As far users follow official file naming patterns, it should not be a problem to do that and we might avoid having to add tons of hook definitions and to have to maintain them.

The current set of builtin schemas is already good but we can easily add more in time.

I may worth remarking that the first hook definition from https://github.com/sirosen/check-jsonschema/blob/main/.pre-commit-hooks.yaml is broken as it would not work when used bare and thow an error for each encountered file:

check-jsonschema: error: Either --schemafile or --builtin-schema must be provided

IMHO, what I describe here would allow this first hook to be used without having to specify any file pattern or schemas to be used.

While testing --builtin-schema I encountered two unexpected errors:

Error: builtin schema could not be loaded, no such schema
NoSuchSchemaError: no builtin schema named playbook.yml was found

My expectation was that a file that does not match a schema association pattern should just not be validated, aka skipped.

The text was updated successfully, but these errors were encountered:

sirosen · 2022-02-04T13:36:07Z

the currently provided set of hooks does not scale well on real-life repositories, where you might easily have more than 10+ files using schemas.

I'm not sure I follow. Why does it not scale well? If you were to use all of the currently supported hooks (other than the generic one) in a single config, it would be:

- repo: https://github.com/sirosen/check-jsonschema
  rev: 0.10.2
  hooks:
    - id: check-github-workflows
    - id: check-github-actions
    - id: check-azure-pipelines
    - id: check-travis
    - id: check-readthedocs
    - id: check-renovate

Somehow the use case you have in mind doesn't match my expectations, since you say "10+ files" and I've made a list of only 6 hooks here.

the first hook definition from https://github.com/sirosen/check-jsonschema/blob/main/.pre-commit-hooks.yaml is broken as it would not work when used bare and thow an error for each encountered file:

This is actually on purpose, but perhaps needs a better explanation. There's example usage in the README:
https://github.com/sirosen/check-jsonschema#applying-an-arbitrary-schema-to-files

The check-jsonschema hook allows users to have a local schema + a set of files to which it applies. e.g. data/schema.json + data/files/foo.json and enforce that data/files/*.json must validate under data/schema.json

If there's a way that this can be made clearer, let's improve it. Or if there's a better way you see of supporting this kind of usage, I'm open to changing the behavior!

While testing --builtin-schema I encountered two unexpected errors:

I think this is a misunderstanding about how that option works. It takes a parameter, which is the name of the schema to use.
The supported schemas are the ones listed in the README: https://github.com/sirosen/check-jsonschema#--builtin-schema

(Aside: I'd like to switch this to finding the names at runtime, so they can be shown in helptext as choices, but I haven't bothered to do the tinkering to get that working yet.)

I sort of understand the desire for a check-jsonschema-all hook, but I want to avoid recreating the whole schemastore catalog.

The python code "doesn't know" anything today about which schema goes with which files. That information is only encoded here in the hooks. I'm hesitant to change that because it keeps the python so much simpler. Right now there's no code for handling globbing or regexes for file matching: I'm able to rely on pre-commit to provide all of that.

If there's a behavior which can be implemented without recreating too much of either schemastore or pre-commit, I need more clarity/information before I can commit to anything.

ssbarnea · 2022-02-04T14:51:32Z

I do have some ideas about how this can be achieved and likely it would involve creating another python project that encapsulates most of schemastore for offline use, including file matching patterns.

Regarding filematching patterns it happens that I already have something similar which is using wcmatch to determine file types at https://github.com/ansible-community/ansible-lint/blob/0837dce8466331dd8e3996dd0785348756fd4370/src/ansiblelint/config.py#L42 -- that list is even user overridable so they can add they own if needed.

Let me know if you find this idea interesting as I may be able to spend few hours over the weekend and produce that behemoth python package. That would also enable offline usage.

sirosen · 2022-02-04T17:23:06Z

If you are interested in working on a pip-installable copy of schemastore, I would be interested in using it.

I still have concerns (mostly performance) about a behavior which reads the whole catalog. I don't want to promise that I would include such a feature, at least not in the near term.
However, having a trustworthy schemastore package would allow me to remove most of the files managed under the vendor-schemas toolchain that I've built. I think I will still need it for the Microsoft and ReadTheDocs schemas.

I would need to have a high level of confidence that we can keep the packaged copy of the schemas up to date.

This may also be worth raising with the SchemaStore maintainers. They're very open to contributions to the schemas themselves, so perhaps they'd be willing to host github.com/SchemaStore/py-schemastore.

sirosen · 2022-02-19T22:52:29Z

In the weeks since this was first discussed, I've added a daily job to check the vendored schemas for updates. It seems to be working nicely. At this point, I'm not really likely to add a dependency on another package to provide the schemas unless it gets some significant traction. I figure I should mention this before anyone puts in significant effort on this front.

ssbarnea · 2022-04-09T06:34:04Z

How about adding a config file with mappings, maybe even compatible with https://github.com/SchemaStore/schemastore/blob/master/src/api/json/catalog.json -- that should allow you to make it work like a linter, where you can feed it json files w/o having to mention their schemas, as that would be determined based on config.

PS. I did not had time to progress on bundling schemastore, especially as I hit some problems like python-jsonschema/jsonschema#931 which hindered my progression.

sirosen · 2022-04-09T16:00:05Z

Right now, this issue is on a course towards being closed as a wontfix. I need information to justify doing work on it, or enough information to know that the use-case is satisfied well enough by other features.

The primary issue is that I still do not understand what the use-case is and why a list of hooks like

- repo: https://github.com/python-jsonschema/check-jsonschema
  rev: 0.14.2
  hooks:
    - id: check-github-workflows
    - id: check-readthedocs
    - id: check-travis
    - ...

isn't sufficient or this use-case.

Can you articulate why that would not work for pre-commit usage, or what the CLI usage is that you would like to see which isn't supported today?

ssbarnea · 2022-05-15T09:25:49Z

You might want to know that yesterday I released first version of schemastore python bundle package, with over 500 schemas in it. That could be used to validate schemas offline.

sirosen · 2022-06-21T16:18:30Z

Absent the information necessary to justify accepting this issue, I'm closing as a wontfix.

sirosen added the enhancement New feature or request label Feb 4, 2022

sirosen added the wontfix This will not be worked on label Jun 21, 2022

sirosen closed this as not planned Won't fix, can't repro, duplicate, stale Jun 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add generic hook for validating all known schemas #33

Add generic hook for validating all known schemas #33

ssbarnea commented Feb 4, 2022 •

edited

Loading

sirosen commented Feb 4, 2022

ssbarnea commented Feb 4, 2022

sirosen commented Feb 4, 2022

sirosen commented Feb 19, 2022

ssbarnea commented Apr 9, 2022

sirosen commented Apr 9, 2022

ssbarnea commented May 15, 2022

sirosen commented Jun 21, 2022

Add generic hook for validating all known schemas #33

Add generic hook for validating all known schemas #33

Comments

ssbarnea commented Feb 4, 2022 • edited Loading

sirosen commented Feb 4, 2022

ssbarnea commented Feb 4, 2022

sirosen commented Feb 4, 2022

sirosen commented Feb 19, 2022

ssbarnea commented Apr 9, 2022

sirosen commented Apr 9, 2022

ssbarnea commented May 15, 2022

sirosen commented Jun 21, 2022

ssbarnea commented Feb 4, 2022 •

edited

Loading