implement file caching #11
Loading…
Reference in New Issue
Block a user
No description provided.
Delete Branch "file-cache"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
So we don't have to download files multiple times we want to cache files and compare with the MD5 checksum if the file is already in cache
fixes: https://gitea.opensuse.org/importers/git-importer/issues/2
@ -165,0 +166,4 @@
if not self.in_cache(name, dirpath, file_md5):
with (dirpath / name).open("wb") as f:
f.write(self._download(project, package, name, revision).read())
shutil.copy(dirpath / name, cached_file)
Try firt a hardlink to save space and speed up the cache:
You can use the same later
Uhm there is the risk of changing the cached version after the next update. Maybe is better to unlink first the fileif present (IIRC there is a parameter to remove any error in the file does not exists, removing the need of calling .exists() before the unlink)
So we want to unlink the normal file (not the file in cache) every time at the start and then download or link again depending on if it is already in cache?
I think so, right? If we unlink we should be keeping the cached version (copy or link) without risk of changing the content. What do you think?
I agree it's the better solution, I just wanted to make sure that I understand your logic correctly
If you afraid of files getting corrupted, why hardlink at all?
Do we know how git handles files in its working copy when you e.g. checkout a different branch?
Can be copying 1.3M files a bit expensive? (I did not measure it, but my crappy HD is killing me every morning)
When possible, hardlinks: https://git-scm.com/docs/git-clone/
It only talks about objects cloned from a local repository. And 1.3M files is for all the repos - and right now I'm downloading them again and again from OBS, which is actually worse :)
But I asked @nkrapp to check how git checkout treats files in the WC and then we can decide.
The git documentation is very ambigous here and doesn't say outright how checkout treats files but they used the wording
updating the index and the files in the working tree
(https://git-scm.com/docs/git-checkout/en) so I'd expect it to actually edit the filesCan be fun to check if this is true, like creating a repo, commiting a file, creating a branch, updating the file in the branch, and create a hard link outside of the repo. If a checkout into master update also the external link we have then only one option (do the copy and forget)
Just tested it and changing a file in a branch also updated the hardlinked file outside of the repo. I guess I'll leave the code as it was then
@ -180,2 +187,4 @@
return root
def _path_from_md5(self, name, dirpath, md5):
as explained later, maybe
dirpath
can be removed@ -181,1 +188,4 @@
return root
def _path_from_md5(self, name, dirpath, md5):
cache = dirpath.joinpath(".cache/")
there is a risk of commiting this into git? maybe use
cache = Path("~/.cache/git-import/").expanduser()
Hmm, I would prefer this to be in the same file system as the repos folder to avoid the copying. And the repos folder is a subdir - but without dot. So make it a cmdln parameter so I can put in $PWD. But the default sounds best like what Alberto suggested
So the path should be
~/git-importer/repos/.cache
?That would work for me personally, but I'd rather do the XDG default that Alberto outlined and allow me to configure it with a command line option.
@ -182,0 +191,4 @@
cache = dirpath.joinpath(".cache/")
if not Path(cache).exists():
cache.mkdir()
filepath = cache.joinpath(f"{md5[0:3]}/{md5[3:6]}/{md5[6:9]}/")
I propose to use the same model than git (also drop joinpath):
filepath = cache / md5[:2] / md5[2:]
As datapoint: we're talking about 1.3M files - that would be still ~5000 files per directory. The git objects are packed for a reason - there are not supposed to be 1.3M of them.
This implies of course we do the .cache independent of the repo path - which I agree is a bad idea
And the .cache is actually commited: https://gitea.opensuse.org/coolo/rpm/src/branch/factory/.cache
In that case
filepath = cache / md5[:3] / md5[3:]
will create 4096 entries in the first level and 300 files in each one.That sounds like a good compromise
@ -182,0 +192,4 @@
if not Path(cache).exists():
cache.mkdir()
filepath = cache.joinpath(f"{md5[0:3]}/{md5[3:6]}/{md5[6:9]}/")
filepath.mkdir(parents=True, exist_ok=True)
you create the directories here, so you can skeip lines 192 and 193
@ -182,0 +196,4 @@
return filepath.joinpath(f"{md5[9:]}-{name}")
def in_cache(self, name, dirpath, md5):
if self._path_from_md5(name, dirpath, md5).is_file():
return self._path_from_md5(name, dirpath, md5).exists()
Please move the md5 validation from git_exporter.py into the part that does the actual download or copying. Validating a hardlink won't be necessary
@nkrapp seems that you introduced some extra commits? Can you rebase?
I tried to resolve the merge conflicts but it seems I just made it more complicated
There are many ways to fix this issue. For now try this one:
Very carefully, drop the 4 commits that you see here that are from coolo
There should not be any conflict now
cd30c3381f
to639096b548
One nit in comments, but LGTM
@ -182,0 +196,4 @@
def _path_from_md5(self, name, cachedir, md5):
filepath = cachedir / md5[:3]
cached_file = f"{md5[3:]}-{name}"
I liked more when it did not contain the {name}? This makes a pure object access. Now if two files has the same content but different name (LICENSE / license, README / README.txt) but the same content it will create two files.
I added this mainly so I could differentiate between the files but I know this isn't really needed outside of tests so I don't mind taking it out