Skip to content

Conversation

@qingqing01
Copy link
Contributor

@qingqing01 qingqing01 commented Dec 25, 2017

Fix #7001

return word_idx


def reader_creator(pos_pattern, neg_pattern, word_idx, buffer_size):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer_size becomes useless.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

buffer_size is never used. Even in the previous experiment, people only set the shuffle buffer.

Copy link
Contributor Author

@qingqing01 qingqing01 Dec 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Remove the buffer size. And I test the time for whether to use two threads.

  • Not use two threads: 16.65757s
  • Use two threads: 25 - 27s. I'm not sure why this is slower, the code is as follows:
def reader_creator(pos_pattern, neg_pattern, word_idx, buffer_size):
    start_time = time.time()
    UNK = word_idx['<unk>']

    POS = []
    NEG = []

    def load(pattern, out, label):
        for doc in tokenize(pattern):
            out.append(([word_idx.get(w, UNK) for w in doc], label))

    # Creates two threads that loads positive and negative samples
    # into qs.
    t0 = threading.Thread(
        target=load, args=(
            pos_pattern,
            POS, 0, ))
    t0.daemon = True
    t0.start()

    t1 = threading.Thread(
        target=load, args=(
            neg_pattern,
            NEG, 1, ))
    t1.daemon = True
    t1.start()

    t0.join()
    t1.join()

    INS = POS + NEG
    random.shuffle(INS)
    duration = time.time() - start_time
    print('\nTotal time: %.5f ' % (duration))

    def reader():
        for doc, label in INS:
            yield doc, label

    return reader

Copy link
Contributor

@dzhwinter dzhwinter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great enhance

Copy link
Collaborator

@reyoung reyoung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent jobs. Thanks

@qingqing01 qingqing01 merged commit c3fd2c2 into PaddlePaddle:develop Dec 26, 2017
@qingqing01 qingqing01 deleted the imdb_data branch November 14, 2019 05:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants