docs_bulk issue #75

paulmoto · 2015-06-16T14:56:12Z

When doing uploads with docs_bulk, it takes a long time, then I get 400 errors and R crashes. I can see that about 90% of my documents are getting indexed before it crashes. Using a PUT with httr works correctly, so the files are formatted correctly. POST did the same thing as docs_bulk, maybe it's using post instead of PUT? I don't see why it should matter which is used, but it apparently does.

sckott · 2015-06-16T16:25:14Z

thanks for your message @paulmoto - looking at this now, I don't think PUT vs. POST should be a problem, but I'll check.

Where is your Elasticsearch instance running?

How big is the data you're putting in?

paulmoto · 2015-06-16T17:01:49Z

Running ES 1.5.2 I've got about 3 GB total, but I've broken it onto ~10MB chunks. Digging deeper, I see character encoding issues on the files that do upload without crashing, maybe this is related? Sending ü is giving me a
"MapperParsingException[failed to parse [concept]]; nested: JsonParseException[Invalid UTF-8 start byte 0x96\n at [Source: [B@4601be2d; line: 1, column: 26]]; "

even though it's coming from a utf-8 encoded source.

paulmoto · 2015-06-16T17:16:41Z

Probably the character encoding problems are making it take a long time, and I can figure out the character encoding (hopefully). The issue here is that R crashes when the response finally comes back.

sckott · 2015-06-16T17:31:53Z

@paulmoto regarding the encoding error. Can you try uploading via curl on the command line and see if you get the same error? Trying to figure out whether this is a problem with the elastic pkg or not

sckott · 2015-06-16T17:32:18Z

Weird that R crashes. I haven't had that problem.

paulmoto · 2015-06-16T18:29:34Z

Using curl I get my response faster, the character encoding issues are still there, and for the files that crashed R the response is gigantic a gigantic list of 2-3 digit numbers followed by the standard response saying which documents were created. I imagine the giant list of numbers filled my memory and crashed R.

sckott · 2015-06-16T19:02:25Z

Oh, i wonder if the output from docs_bulk() printing to the R console is what you're talking about, and may be the problem here, overflowing what the R console can handle or something like that

sckott · 2015-06-16T19:14:50Z

Are you loading in via a data.frame or list in R, or via a file?

paulmoto · 2015-06-16T19:20:41Z

I'm using .txt files. The character encoding was due to R not saving the txt as UTF encoded, not a big deal. I'm looking in to what is causing this extremely long response.

sckott · 2015-06-16T19:59:14Z

Hmm, we don't write your file if you pass in file path, see https://github.com/ropensci/elastic/blob/master/R/docs_bulk.r#L98-L112 so we're not changing encoding

paulmoto · 2015-06-16T20:22:14Z

The file was generated from another elasticsearch query which I parsed and wrote to a text file using write instead of writeLines, which caused the character issue. This doesn't look like a problem with docs_bulk. There's gotta be something in my text files causing the extremely long response, which is the problem here.

paulmoto · 2015-06-16T21:38:31Z

The only difference between using PUT in httr and docs_bulk is that if I use PUT, I can view my response without crashing.

sckott · 2015-06-16T21:48:05Z

Hmm, okay, I'll keep looking into that, though the documentation I've seen always uses POST

have you seen the docs for bulk API https://www.elastic.co/guide/en/elasticsearch/guide/current/bulk.html especially the section at the bottom on size of data. Have you played around with size of each chunk of documents? e.g., in the docs_bulk() function you can change chunk size with the chunk_size parameter. perhaps your chunk size is too big?

paulmoto · 2015-06-17T14:40:24Z

I figured out my issue. In jsonlite, the toJSON function drops the names of named vectors(had to switch to using lists). This was happening for around 10% of my documents, generating lots of parsing errors, which was the long response. I was able to parse the response I got out of PUT to find the errors, whereas using docs_bulk I crashed when trying to view the response.

sckott · 2015-06-17T17:07:40Z

can you share one of the documents that produced a problem? So I can see if I can fix in the package

paulmoto · 2015-06-17T18:44:10Z

Unfortunately, I can't share my data because it contains user-specific information. The files were about 10 MB with about 5000-7000 documents in them (approx 10% of documents didn't match my mapping). I would imagine that trying to upload any data that doesn't match the mapping would do the same thing.

sckott · 2015-06-17T20:17:17Z

@paulmoto okay, understand about personal data.

So when you changed to inputting R lists, docs_bulk() worked for you? Or were there still problems?

paulmoto · 2015-06-17T20:19:05Z

Yes, when everything is correct, docs_bulk() works fine.

bump 99 ver and remove date from description file

sckott · 2015-06-17T21:16:03Z

can you try reinstalling elastic from github then try docs_bulk() again with your data?

there is a hidden parameter in jsonlite::toJSON that let's us convert named vectors to lists, though the maintainer says he will drop it soon jeroen/jsonlite#82 so don't want to rely on that.

instead, i'm just checking for a vector and converting into a list with as.listI() before we get to jsonlite::toJSON

sckott · 2015-06-17T21:16:30Z

@paulmoto forgot to mention you, let me know if that works

paulmoto · 2015-06-17T21:57:56Z

I wasn't using docs_bulk to upload an R object, so I doubt this would affect me at all. I had written my own (bad) code to write the text files. If you've made changes that would affect me, I can test this tomorrow.

sckott · 2015-06-17T22:00:38Z

@paulmoto okay, so then I don't understand how you're using docs_bulk() - are you passing in file paths to the function?

paulmoto · 2015-06-18T00:42:44Z

Yes. The toJSON vector issue was why I had incorrectly formatted files that I then passed to docs_bulk. The long response generated by elasticsearch because of the incorrectly formatted files caused R to crash somehow when I passed the file through docs_bulk. Passing the same file with a PUT received the same response (presumably) and did not crash.

sckott · 2015-07-01T17:47:17Z

@paulmoto Can you update to httr v1 (now on CRAN) and let me know if you still get the same problem? Just checking to see if new httr fixes this

sckott added a commit that referenced this issue Jun 17, 2015

change named vectors to lists in toJSON, #75

742a682

bump 99 ver and remove date from description file

sckott self-assigned this Jun 29, 2015

sckott closed this as completed Aug 15, 2015

docs_bulk issue #75

docs_bulk issue #75

Comments

paulmoto commented Jun 16, 2015

sckott commented Jun 16, 2015

Uh oh!

paulmoto commented Jun 16, 2015

Uh oh!

paulmoto commented Jun 16, 2015

Uh oh!

sckott commented Jun 16, 2015

Uh oh!

sckott commented Jun 16, 2015

Uh oh!

paulmoto commented Jun 16, 2015

Uh oh!

sckott commented Jun 16, 2015

Uh oh!

sckott commented Jun 16, 2015

Uh oh!

paulmoto commented Jun 16, 2015

Uh oh!

sckott commented Jun 16, 2015

Uh oh!

paulmoto commented Jun 16, 2015

Uh oh!

paulmoto commented Jun 16, 2015

Uh oh!

sckott commented Jun 16, 2015

Uh oh!

paulmoto commented Jun 17, 2015

Uh oh!

sckott commented Jun 17, 2015

Uh oh!

paulmoto commented Jun 17, 2015

Uh oh!

sckott commented Jun 17, 2015

Uh oh!

paulmoto commented Jun 17, 2015

Uh oh!

sckott commented Jun 17, 2015

Uh oh!

sckott commented Jun 17, 2015

Uh oh!

paulmoto commented Jun 17, 2015

Uh oh!

sckott commented Jun 17, 2015

Uh oh!

paulmoto commented Jun 18, 2015

Uh oh!

sckott commented Jul 1, 2015

Uh oh!