Skip to content

Osquery runtime refactoring, bugfixes and stability improvements #176

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 17, 2017
Merged

Osquery runtime refactoring, bugfixes and stability improvements #176

merged 7 commits into from
Oct 17, 2017

Conversation

zwass
Copy link
Contributor

@zwass zwass commented Oct 13, 2017

  • Changes to public API to better reflect actual usage and ease implementation.
  • Use errgroup for coordination of process management/cleanup. This helps
    prevent leaking of goroutines (relative to existing implementation).
  • Fix bug in which osquery process was not restarted after failure.
  • Allow logger to be set properly.
  • Add logging around recovery scenarios.
  • Check communication with both osquery and extension server in health check
    (previously only the extension server was checked).
  • Add healthcheck on interval that initiates recovery on failure (Closes Implement osqueryd health check #141).
  • Do not set cmd output to ioutil.Discard. Causes a bug with cmd.Wait (see
    os/exec: Inconsistent behaviour in exec.Cmd.Wait golang/go#20730).


// Error case
err := r.instance.eg.Wait()
level.Error(r.instance.logger).Log(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is going to be controversial, but let's keep the logs as level.Info with err as a message. field

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we were doing:

level.Info(logger).Log(
  "msg", "this is the message about what went wrong",
  "err", err, // this may also be a wrapped error. if so, you can leave off msg.
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I clarify... Are you both suggesting the same thing, or is there some difference?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're suggesting the same thing.

@zwass zwass changed the title Refactor osquery runtime Osquery runtime refactoring, bugfixes and stability improvements Oct 13, 2017
@@ -147,7 +147,7 @@ func main() {
defer ext.Shutdown()

// Start the osqueryd instance
instance, err := osquery.LaunchOsqueryInstance(
runner := osquery.NewRunner(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm interested as to why you broke up LaunchOsqueryInstance into this two step process of creating a runner and explicitly starting it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a bit more natural to me, and is patterned after https://golang.org/pkg/net/http/#Server. I don't think it has to be this way.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be LaunchOsqueryInstance for a few reasons.

  • There is no time in the codebase where you start a runner and don't immediately start it.
  • It is not possible to create a runner, pass it to two functions, and have them both start it with different instances.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decoupling the runner from the Start method has the advantage of making the whole thing easier to test.

I do think it should be called Run not Start, but I agree with the separation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really care and I'm happy to agree for the sake of not talking about it further, but tests should use the private interface, the public interface should reflect the expected external usage of the library in my opinion.

As far as I can tell, there is no advantage to launching the instance in two steps from this perspective. Does having a "runner" allow you to do anything different or more advanced?

// the following are instance artifacts that are created and held as a result
// of launching an osqueryd process
eg *errgroup.Group
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can this be called something more descriptive than eg

@@ -147,7 +147,7 @@ func main() {
defer ext.Shutdown()

// Start the osqueryd instance
instance, err := osquery.LaunchOsqueryInstance(
runner := osquery.NewRunner(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be LaunchOsqueryInstance for a few reasons.

  • There is no time in the codebase where you start a runner and don't immediately start it.
  • It is not possible to create a runner, pass it to two functions, and have them both start it with different instances.

// updated wholesale without updating the actual OsqueryInstance pointer which
// may be held by the original caller.
type osqueryInstanceFields struct {
type osqueryOptions struct {
// the following are options which may or may not be set by the functional
// options included by the caller of LaunchOsqueryInstance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this comment with the new name is LaunchOsqueryInstance is decided to not be used.

for _, opt := range opts {
opt(o)
}
r.instanceLock.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this lock being held if r.instance.eg.Wait() will block forever (if there are no errors)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

r.instance.Wait() will wait until all of the async routines have completed, and should not block indefinitely. We only reached this point after a write was made to r.instance.doneCtx.Done(), which indicates to all the async threads that they should exit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I may be able to move this below line 338 though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But don't the async routines that are started in the err group loop indefinitely until Shutdown() is called? Which also requires the acquisition of this lock?

I haven't tested this, but as it is now, once this goroutine starts, won't one not be able to call Shutdown(), Restart(), or Healthy() since they all require the acquisition of r.instanceLock

I'm sorry if I'm missing something obvious here, I'm just confused and trying to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is only hit after some error (or call to Shutdown or Restart) causes the r.instance.doneCtx.Done() channel to close (causing the channel read to complete on line 325). Moving the lock below 338 may help avoid any deadlock that could be caused by holding the lock during the wait.

}
go func() {
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this in an infinite for loop if r.instance.eg.Wait() is supposed to block until something goes wrong

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop ensures that we start a new instance each time the existing instance dies, until the shutdown channel is closed (Shutdown was called).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants