Osquery runtime refactoring, bugfixes and stability improvements #176

zwass · 2017-10-13T19:28:20Z

Changes to public API to better reflect actual usage and ease implementation.
Use errgroup for coordination of process management/cleanup. This helps
prevent leaking of goroutines (relative to existing implementation).
Fix bug in which osquery process was not restarted after failure.
Allow logger to be set properly.
Add logging around recovery scenarios.
Check communication with both osquery and extension server in health check
(previously only the extension server was checked).
Add healthcheck on interval that initiates recovery on failure (Closes Implement osqueryd health check #141).
Do not set cmd output to ioutil.Discard. Causes a bug with cmd.Wait (see
os/exec: Inconsistent behaviour in exec.Cmd.Wait golang/go#20730).

groob · 2017-10-13T20:03:18Z

osquery/runtime.go

+
+			// Error case
+			err := r.instance.eg.Wait()
+			level.Error(r.instance.logger).Log(


This is going to be controversial, but let's keep the logs as level.Info with err as a message. field

I thought we were doing:

level.Info(logger).Log( "msg", "this is the message about what went wrong", "err", err, // this may also be a wrapped error. if so, you can leave off msg. )

Can I clarify... Are you both suggesting the same thing, or is there some difference?

I think we're suggesting the same thing.

marpaia · 2017-10-13T21:09:13Z

cmd/launcher/launcher.go

@@ -147,7 +147,7 @@ func main() {
 	defer ext.Shutdown()

 	// Start the osqueryd instance
-	instance, err := osquery.LaunchOsqueryInstance(
+	runner := osquery.NewRunner(


I'm interested as to why you broke up LaunchOsqueryInstance into this two step process of creating a runner and explicitly starting it?

It feels a bit more natural to me, and is patterned after https://golang.org/pkg/net/http/#Server. I don't think it has to be this way.

I think it should be LaunchOsqueryInstance for a few reasons.

There is no time in the codebase where you start a runner and don't immediately start it.

It is not possible to create a runner, pass it to two functions, and have them both start it with different instances.

Decoupling the runner from the Start method has the advantage of making the whole thing easier to test.

I do think it should be called Run not Start, but I agree with the separation.

I don't really care and I'm happy to agree for the sake of not talking about it further, but tests should use the private interface, the public interface should reflect the expected external usage of the library in my opinion.

As far as I can tell, there is no advantage to launching the instance in two steps from this perspective. Does having a "runner" allow you to do anything different or more advanced?

marpaia · 2017-10-16T15:43:47Z

osquery/runtime.go

 	// the following are instance artifacts that are created and held as a result
 	// of launching an osqueryd process
+	eg                     *errgroup.Group


nit: can this be called something more descriptive than eg

marpaia · 2017-10-16T15:49:34Z

cmd/launcher/launcher.go

@@ -147,7 +147,7 @@ func main() {
 	defer ext.Shutdown()

 	// Start the osqueryd instance
-	instance, err := osquery.LaunchOsqueryInstance(
+	runner := osquery.NewRunner(


I think it should be LaunchOsqueryInstance for a few reasons.

There is no time in the codebase where you start a runner and don't immediately start it.

It is not possible to create a runner, pass it to two functions, and have them both start it with different instances.

marpaia · 2017-10-16T15:51:42Z

osquery/runtime.go

-// updated wholesale without updating the actual OsqueryInstance pointer which
-// may be held by the original caller.
-type osqueryInstanceFields struct {
+type osqueryOptions struct {
 	// the following are options which may or may not be set by the functional
 	// options included by the caller of LaunchOsqueryInstance


Update this comment with the new name is LaunchOsqueryInstance is decided to not be used.

marpaia · 2017-10-16T15:53:59Z

osquery/runtime.go

-	for _, opt := range opts {
-		opt(o)
-	}
+			r.instanceLock.Lock()


why is this lock being held if r.instance.eg.Wait() will block forever (if there are no errors)?

r.instance.Wait() will wait until all of the async routines have completed, and should not block indefinitely. We only reached this point after a write was made to r.instance.doneCtx.Done(), which indicates to all the async threads that they should exit.

I think I may be able to move this below line 338 though.

But don't the async routines that are started in the err group loop indefinitely until Shutdown() is called? Which also requires the acquisition of this lock?

I haven't tested this, but as it is now, once this goroutine starts, won't one not be able to call Shutdown(), Restart(), or Healthy() since they all require the acquisition of r.instanceLock

I'm sorry if I'm missing something obvious here, I'm just confused and trying to understand.

This line is only hit after some error (or call to Shutdown or Restart) causes the r.instance.doneCtx.Done() channel to close (causing the channel read to complete on line 325). Moving the lock below 338 may help avoid any deadlock that could be caused by holding the lock during the wait.

marpaia · 2017-10-16T15:54:42Z

osquery/runtime.go

 	}
+	go func() {
+		for {


why is this in an infinite for loop if r.instance.eg.Wait() is supposed to block until something goes wrong

The loop ensures that we start a new instance each time the existing instance dies, until the shutdown channel is closed (Shutdown was called).

…time

Refactor osquery runtime

a90380b

zwass added zzHistoric:Core Platform zzHistoric:Hardening labels Oct 13, 2017

zwass requested review from marpaia and groob October 13, 2017 19:28

groob reviewed Oct 13, 2017

View reviewed changes

zwass changed the title ~~Refactor osquery runtime~~ Osquery runtime refactoring, bugfixes and stability improvements Oct 13, 2017

marpaia reviewed Oct 13, 2017

View reviewed changes

Bit of cleanup

9b98b0a

marpaia reviewed Oct 16, 2017

View reviewed changes

zwass and others added 4 commits October 16, 2017 12:49

@groob and @marpaia comments

4fb387e

refactor LaunchInstance

851a8dd

add deps from master

1e31a84

Merge branch 'master' of github.com:kolide/launcher into refactor_run…

e3ea1c4

…time

marpaia approved these changes Oct 17, 2017

View reviewed changes

update dependencies

49f3c78

marpaia merged commit f7ea653 into kolide:master Oct 17, 2017

zwass mentioned this pull request Oct 24, 2017

launcher stops reporting to the server after a "Unavailable" grpc error #134

Closed

Osquery runtime refactoring, bugfixes and stability improvements #176

Osquery runtime refactoring, bugfixes and stability improvements #176

Uh oh!

Conversation

zwass commented Oct 13, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

zwass commented Oct 13, 2017 •

edited

Loading