Skip to content
This repository was archived by the owner on Feb 12, 2022. It is now read-only.

Caching files #36

Open
wants to merge 32 commits into
base: master
Choose a base branch
from
Open

Caching files #36

wants to merge 32 commits into from

Conversation

mjsobrep
Copy link

@mjsobrep mjsobrep commented Sep 1, 2019

Issue: #19

Description of changes:
Added support for caching files by saving them and keeping note of them in sqlite db. Cache file size is tracked and when a maximum is exceeded, the least recently used file is deleted.

TODO

  • Expose the cache size to the launch system
  • Allow the cache location to be set in the launch system
  • Test more widely (only tested with the simple action server example)
  • Document all of this for general use
  • Add unit tests for this functionality

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@codecov
Copy link

codecov bot commented Sep 1, 2019

Codecov Report

Merging #36 (ee320dd) into master (c8ebbf7) will increase coverage by 4.96%.
The diff coverage is 94.68%.

❗ Current head ee320dd differs from pull request most recent head b3829ee. Consider uploading reports for the commit b3829ee to get more accurate results

Impacted file tree graph

@@            Coverage Diff             @@
##           master      #36      +/-   ##
==========================================
+ Coverage   77.45%   82.41%   +4.96%     
==========================================
  Files           3        4       +1     
  Lines         275      364      +89     
==========================================
+ Hits          213      300      +87     
- Misses         62       64       +2     
Flag Coverage Δ
ROS_1 82.41% <94.68%> (+4.96%) ⬆️
kinetic 82.41% <94.68%> (+4.96%) ⬆️
melodic 82.41% <94.68%> (+4.96%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
tts/src/tts/db.py 86.84% <86.84%> (ø)
tts/src/tts/synthesizer.py 92.53% <100.00%> (+8.19%) ⬆️
tts/src/tts/__init__.py 96.42% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dfd3802...b3829ee. Read the comment docs.

@mjsobrep
Copy link
Author

mjsobrep commented Sep 1, 2019

I can't get the code coverage up because of #37 . When those tests do start running, I expect that it will be a good idea to add random strings to the end of any test text that need to be pinging the server. Otherwise repeated runs of tests will just hit the cache. There also need to be some tests which intentionally hit the cache.

@mjsobrep
Copy link
Author

mjsobrep commented Sep 9, 2019

@AAlon is this something that you all are interested in? Should I keep working on it?

@AAlon
Copy link
Contributor

AAlon commented Sep 13, 2019

@AAlon is this something that you all are interested in? Should I keep working on it?

Definitely! thank you for contributing. We'll be reviewing this soon.

Copy link
Contributor

@cevans87 cevans87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for submitting this. I'm a fan of the approach. Just a few things that I think could show up as bugs and some suggestions.

I'll get back to you on how we can deal with test coverage after I understand #37 a bit better.

@mjsobrep
Copy link
Author

mjsobrep commented Nov 9, 2019

@cevans87 Thanks for the feedback, sorry for being slow to get back on it. I have gotten rid of the size table per your recommendation and pulled some stuff in the db class to improve readability. I have tested under both normal operation and deleting cached files from the disk without telling the db.

@mjsobrep
Copy link
Author

I just found a bug with this: If there is no network connection, then the error utterance (' the polly service cannot be reached ...') gets cached as whatever is being said. Then the system will always say that when the text is being said.

@AAlon
Copy link
Contributor

AAlon commented Jan 6, 2020

I just found a bug with this: If there is no network connection, then the error utterance (' the polly service cannot be reached ...') gets cached as whatever is being said. Then the system will always say that when the text is being said.

Are you actively working on that? if so we should probably wait before re-reviewing. It'd be great if you could add some unit tests too. Thanks!

@mjsobrep
Copy link
Author

@AAlon I took a look at it a while back, but haven't had time to come up with a good fix. It seems like the way that errors are handled is a little odd. I will take a look at fixing it eventually, but it will require the errors to propagate up rather than just being handled by passing a different sound file. I actually think that would be a good thing. In many situations the current error handling would be poor for human robot interaction. Allowing integrators (like me) to customize how errors are handled would be better. I should get to that at the latest by the middle of February. I'm pretty busy until then.

I'm not sure how to add unit tests, ref #37

@mjsobrep
Copy link
Author

@cevans87 @AAlon I fixed the caching of the error file and added a bunch of new tests to get the code coverage up to a reasonable level. I think this is ready for a new review.

@mjsobrep
Copy link
Author

Want to check back in on this.

Copy link
Contributor

@mm318 mm318 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the significant contribution!

I have some comments and questions.

Comment on lines +90 to +153
class DummyEngine:
"""A dummy engine which exists to facilitate testing. Can either
be set to act as if it is connected or disconnected. Will create files where
they are expected, but they will not be actual audio files."""

def __init__(self):
self.connected = True
self.file_size = 50000

def __call__(self, **kwargs):
"""put a file at the specified location and return resonable dummy
values. If not connected, fills in the Exception fields.

Args:
**kwarks: dictionary with fields: output_format, voice_id, sample_rate,
text_type, text, output_path

Returns: A json version of a string with fields: Audio File, Audio Type,
Exception (if there is an exception), Traceback (if there is an exception),
and if succesful Amazon Polly Response Metadata
"""
if self.connected:
with open(kwargs['output_path'], 'wb') as f:
f.write(os.urandom(self.file_size))
output_format = kwargs['OutputFormat'] if 'OutputFormat' in kwargs else 'ogg_vorbis'
resp = json.dumps({
'Audio File': kwargs['output_path'],
'Audio Type': output_format,
'Amazon Polly Response Metadata': {'some header': 'some data'}
})
return SynthesizerResponse(resp)
else:
current_dir = os.path.dirname(os.path.abspath(__file__))
error_ogg_filename = 'connerror.ogg'
error_details = {
'Audio File': os.path.join(current_dir, '../src/tts/data', error_ogg_filename),
'Audio Type': 'ogg',
'Exception': {
'dummy head': 'dummy val'
# 'Type': str(exc_type),
# 'Module': exc_type.__module__,
# 'Name': exc_type.__name__,
# 'Value': str(e),
},
'Traceback': 'some traceback'
}
return SynthesizerResponse(json.dumps(error_details))


def set_connection(self, connected):
"""set the connection state

Args:
connected: boolean, whether to act connected or not
"""
self.connected = connected

def set_file_sizes(self, size):
"""Set the target file size for future files in bytes

Args:
size: the number of bytes to make the next files
"""
self.file_size = size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm kind of on the fence about this class being here instead of being under test, such as in test_unit_synthesizer.py.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this belongs under test/

import uuid

db = DB()
init_num_files = db.get_num_files()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should you have an assert that init_num_files == 0? So later on, your test should be db.get_num_files() == 1 instead of db.get_num_files() == init_num_files + 1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may need a deletion of /tmp/polly.db at the end of each test case.

What is the right way to delete a database and all the audio files associated with it?

self.assertEqual(db.get_num_files(), init_num_files + 1)


def test_multiple_novel(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably better named test_multiple_unique.

self.assertEqual(db.get_num_files(), init_num_files)

def test_no_connection_existing(self):
from tts.db import DB
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused import.

import json
import os

db = DB()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though, it would be better to have some tests for db.get_num_files().

self.assertEqual(audio_file1, audio_file2)
self.assertTrue(os.path.exists(audio_file2))

def test_no_connection_novel(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_no_connection_nonexisting?


self.assertFalse(os.path.exists(audio_file1))

def test_big_db(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think test_db_size_limit would be a better name.

@mm318 mm318 requested review from AAlon and removed request for timrobotson March 23, 2020 09:51
Copy link
Contributor

@AAlon AAlon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for taking the time and effort to iterate on this @mjsobrep! Most of my comments are very minor, but I do have one concern around performance and SQLite I'd like to have addressed before finalizing on this implementation.

Comment on lines +90 to +153
class DummyEngine:
"""A dummy engine which exists to facilitate testing. Can either
be set to act as if it is connected or disconnected. Will create files where
they are expected, but they will not be actual audio files."""

def __init__(self):
self.connected = True
self.file_size = 50000

def __call__(self, **kwargs):
"""put a file at the specified location and return resonable dummy
values. If not connected, fills in the Exception fields.

Args:
**kwarks: dictionary with fields: output_format, voice_id, sample_rate,
text_type, text, output_path

Returns: A json version of a string with fields: Audio File, Audio Type,
Exception (if there is an exception), Traceback (if there is an exception),
and if succesful Amazon Polly Response Metadata
"""
if self.connected:
with open(kwargs['output_path'], 'wb') as f:
f.write(os.urandom(self.file_size))
output_format = kwargs['OutputFormat'] if 'OutputFormat' in kwargs else 'ogg_vorbis'
resp = json.dumps({
'Audio File': kwargs['output_path'],
'Audio Type': output_format,
'Amazon Polly Response Metadata': {'some header': 'some data'}
})
return SynthesizerResponse(resp)
else:
current_dir = os.path.dirname(os.path.abspath(__file__))
error_ogg_filename = 'connerror.ogg'
error_details = {
'Audio File': os.path.join(current_dir, '../src/tts/data', error_ogg_filename),
'Audio Type': 'ogg',
'Exception': {
'dummy head': 'dummy val'
# 'Type': str(exc_type),
# 'Module': exc_type.__module__,
# 'Name': exc_type.__name__,
# 'Value': str(e),
},
'Traceback': 'some traceback'
}
return SynthesizerResponse(json.dumps(error_details))


def set_connection(self, connected):
"""set the connection state

Args:
connected: boolean, whether to act connected or not
"""
self.connected = connected

def set_file_sizes(self, size):
"""Set the target file size for future files in bytes

Args:
size: the number of bytes to make the next files
"""
self.file_size = size
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this belongs under test/

# need to look at the hash itself.
db = DB()
db_search_result = db.ex(
'SELECT file, audio_type FROM cache WHERE hash=?', tmp_filename).fetchone()
Copy link
Contributor

@AAlon AAlon Apr 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The synthesizer shouldn't need to know SQL, ex should be private and the DB class needs to abstract queries away - via more methods like db.find_record(key), db.update_last_accessed(key), db.add(record). Maybe also rename DB to SQLiteBackedCache or something?
Alternatively, keep DB as is and create a new SQLiteBackedCache to abstract the functionality of accessing a database.


:param kw: what AmazonPolly needs to synthesize
:return: response from AmazonPolly
"""
if 'output_path' not in kw:
tmp_filename = hashlib.md5(kw['text']).hexdigest()
tmp_filepath = os.path.join(os.sep, 'tmp', 'voice_{}_{}'.format(tmp_filename, str(time.time())))
tmp_filename = hashlib.md5(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs better a name. filename_hash maybe.

not given, the utterance is added to the cache. If a
filename is specified, then we will assume that the
file is being managed by the user and it will not
be added to the cache.

:param kw: what AmazonPolly needs to synthesize
:return: response from AmazonPolly
"""
if 'output_path' not in kw:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice to break this function into 2-3 smaller ones (perhaps at the if/else blocks)

tmp_filepath = os.path.join(os.sep, 'tmp', 'voice_{}_{}'.format(tmp_filename, str(time.time())))
tmp_filename = hashlib.md5(
json.dumps(kw, sort_keys=True)).hexdigest()
tmp_filepath = os.path.join(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it was like this before but the prefix tmp_ doesn't make much sense to me, especially now that we're caching things.

tmp_filename = hashlib.md5(
json.dumps(kw, sort_keys=True)).hexdigest()
tmp_filepath = os.path.join(
os.sep, 'tmp', 'voice_{}'.format(tmp_filename))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use tempfile.gettempdir() instead of hardcoding.

Comment on lines +241 to +247
while db.get_size() > self.max_cache_bytes and db.get_num_files() > 1:
remove_res = db.ex(
'select file, min(last_accessed), size from cache'
).fetchone()
db.remove_file(remove_res['file'])
rospy.loginfo('removing %s to maintain cache size, new size: %i',
remove_res['file'], db.get_size())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic should be part of the DB class' add method (or some intermediary Cache class).

Comment on lines +210 to +240
# TODO: add a test that deletes a file without telling the db and tries to synthesize it
if os.path.exists(db_search_result['file']):
file_found = True
db.ex('update cache set last_accessed=? where hash=?',
current_time, tmp_filename)
synth_result = PollyResponse(json.dumps({
'Audio File': db_search_result['file'],
'Audio Type': db_search_result['audio_type'],
'Amazon Polly Response Metadata': ''
}))
rospy.loginfo('audio file was already cached at: %s',
db_search_result['file'])
else:
rospy.logwarn(
'A file in the database did not exist on the disk, removing from db')
db.remove_file(db_search_result['file'])
if not file_found: # havent cached this yet
rospy.loginfo('Caching file')
synth_result = self.engine(**kw)
res_dict = json.loads(synth_result.result)
if 'Exception' not in res_dict:
file_name = res_dict['Audio File']
if file_name:
file_size = os.path.getsize(file_name)
db.ex('''insert into cache(
hash, file, audio_type, last_accessed,size)
values (?,?,?,?,?)''', tmp_filename, file_name,
res_dict['Audio Type'], current_time, file_size)
rospy.loginfo(
'generated new file, saved to %s and cached', file_name)
# make sure the cache hasn't grown too big
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Performance concerns

  • Potentially doing many more transactions than needed; only need to do at most one transaction per _call_engine call. For example, if the new file caused us to go beyond the size limit by 1000 bytes, and we need to remove the 10 oldest files in order to clear that much space, we'd be doing more than 10 transactions.
  • Seems like everything is synchronous, so even if we only have one transaction (say, to insert a new entry) - returning synth_result is still blocked on completion of that transaction. Using WAL and performing checkpointing in a separate thread is one way to overcome that.

Performance Impact discussion

In part, the value of this node is in enabling Human-Robot interaction via Polly, and that heavily depends on the latency; if I'm asking the robot a question and it answers 5 seconds later, that's not very useful.

We know that with this implementation there will be increased latency, the question is how much. Would it add 1ms, or 500ms? maybe it's even less than 1ms, I really don't know. Do you?
It would feel a bit premature to me to merge this without doing a simple benchmark in which we mock out line 228 (the self.engine() call) and measure the added latency (in lieu of any other data).

Path forward

Generally, it seems like we're reinventing the wheel here by trying to use SQLite directly. There are dozens of libraries that implemented all the convenience functionality we'd ever need (see DiskCache) as well as performance / transaction optimizations. So to summarize:

  • Why not use some of the battle tested disk-backed caches out there? Did you choose sqlite because of lack of integration with ROS? (i.e. not available via rosdep?)
  • If we have to continue using SQLite directly, we need to test the effect on latency. Again, maybe there's no issue at all, but we need to make sure.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I have been so slow to respond. I'm going to start here before going anywhere else, since the answers on this thread may change what is being done elsewhere.

I totally agree that latency is a major issue in these types of systems. SQLite is fast and I would imagine that it is at least an order of magnitude faster than making a web call that downloads a file, probably multiple orders of magnitude better. But you are correct, I don't know that for a fact.

I chose SQLite because it is built into python, it is stable, and it is well known/documented. I could certainly see another tool being used, but I don't know which other tools would check those boxes. DiskCache that you link to looks great. Is that a dependency you all are ok with?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @AAlon, want to check back in on this. If you would like me to refactor this to DiskCache or make some timing tests for the current implementation, I can do either of those. DiskCache would make for cleaner more maintainable code, but not sure if that is a dependency that y'all want in here. Just let me know, I'm hoping to get some code done in a few places over the next few weeks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's fine, but we should probably take a look at the benchmarking mentioned previously.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is "that" in "that's fine"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'll be specific. @mjsobrep if you would like to implement some timing tests, in order to provide a ground truth, and test your changes here (along with possibly refactoring to DiskCache) I think that's a good start.

I'm not sure if you'e seen it, but the DiskCache timing info seems promising.

class DB(object):
"""DB a class to manage the database for tracking cached files"""

def __init__(self, db_location='/tmp/polly.db'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tempfile.gettempdir() instead of /tmp?

def make_db(self):
self.ex('''CREATE TABLE IF NOT EXISTS cache (
hash text PRIMARY KEY,
file text NOT NULL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

file is actually a filepath, correct?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants