implement file caching #11

Merged
Ghost merged 3 commits from file-cache into main 2022-11-03 14:24:22 +01:00
First-time contributor

So we don't have to download files multiple times we want to cache files and compare with the MD5 checksum if the file is already in cache

fixes: https://gitea.opensuse.org/importers/git-importer/issues/2

So we don't have to download files multiple times we want to cache files and compare with the MD5 checksum if the file is already in cache fixes: https://gitea.opensuse.org/importers/git-importer/issues/2
Ghost added 1 commit 2022-11-02 17:04:58 +01:00
to prevent having to download files multiple times
aplanas reviewed 2022-11-02 17:26:45 +01:00
lib/obs.py Outdated
@ -165,0 +166,4 @@
if not self.in_cache(name, dirpath, file_md5):
with (dirpath / name).open("wb") as f:
f.write(self._download(project, package, name, revision).read())
shutil.copy(dirpath / name, cached_file)
Owner

Try firt a hardlink to save space and speed up the cache:

try:
    cached_file.hardlink_to(dirpath / name)
except OSError:
    # Maybe a different mount point
    # TODO use bindmount for files?
    shutil.copy(dirpath / name, cached_file)

You can use the same later

Try firt a hardlink to save space and speed up the cache: ```python try: cached_file.hardlink_to(dirpath / name) except OSError: # Maybe a different mount point # TODO use bindmount for files? shutil.copy(dirpath / name, cached_file) ``` You can use the same later
Owner

Uhm there is the risk of changing the cached version after the next update. Maybe is better to unlink first the fileif present (IIRC there is a parameter to remove any error in the file does not exists, removing the need of calling .exists() before the unlink)

Uhm there is the risk of changing the cached version after the next update. Maybe is better to unlink first the fileif present (IIRC there is a parameter to remove any error in the file does not exists, removing the need of calling .exists() before the unlink)
Author
First-time contributor

So we want to unlink the normal file (not the file in cache) every time at the start and then download or link again depending on if it is already in cache?

So we want to unlink the normal file (not the file in cache) every time at the start and then download or link again depending on if it is already in cache?
Owner

I think so, right? If we unlink we should be keeping the cached version (copy or link) without risk of changing the content. What do you think?

I think so, right? If we unlink we should be keeping the cached version (copy or link) without risk of changing the content. What do you think?
Author
First-time contributor

I agree it's the better solution, I just wanted to make sure that I understand your logic correctly

I agree it's the better solution, I just wanted to make sure that I understand your logic correctly
Author
First-time contributor

If you afraid of files getting corrupted, why hardlink at all?

If you afraid of files getting corrupted, why hardlink at all?
Author
First-time contributor

Do we know how git handles files in its working copy when you e.g. checkout a different branch?

Do we know how git handles files in its working copy when you e.g. checkout a different branch?
Owner

Can be copying 1.3M files a bit expensive? (I did not measure it, but my crappy HD is killing me every morning)

Can be copying 1.3M files a bit expensive? (I did not measure it, but my crappy HD is killing me every morning)
Owner

Do we know how git handles files in its working copy when you e.g. checkout a different branch?

When possible, hardlinks: https://git-scm.com/docs/git-clone/

> Do we know how git handles files in its working copy when you e.g. checkout a different branch? When possible, hardlinks: https://git-scm.com/docs/git-clone/
Author
First-time contributor

It only talks about objects cloned from a local repository. And 1.3M files is for all the repos - and right now I'm downloading them again and again from OBS, which is actually worse :)

But I asked @nkrapp to check how git checkout treats files in the WC and then we can decide.

It only talks about objects cloned from a local repository. And 1.3M files is for all the repos - and right now I'm downloading them again and again from OBS, which is actually worse :) But I asked @nkrapp to check how git checkout treats files in the WC and then we can decide.
Author
First-time contributor

The git documentation is very ambigous here and doesn't say outright how checkout treats files but they used the wording updating the index and the files in the working tree (https://git-scm.com/docs/git-checkout/en) so I'd expect it to actually edit the files

The git documentation is very ambigous here and doesn't say outright how checkout treats files but they used the wording `updating the index and the files in the working tree` (https://git-scm.com/docs/git-checkout/en) so I'd expect it to actually edit the files
Owner

Can be fun to check if this is true, like creating a repo, commiting a file, creating a branch, updating the file in the branch, and create a hard link outside of the repo. If a checkout into master update also the external link we have then only one option (do the copy and forget)

Can be fun to check if this is true, like creating a repo, commiting a file, creating a branch, updating the file in the branch, and create a hard link outside of the repo. If a checkout into master update also the external link we have then only one option (do the copy and forget)
Author
First-time contributor

Just tested it and changing a file in a branch also updated the hardlinked file outside of the repo. I guess I'll leave the code as it was then

Just tested it and changing a file in a branch also updated the hardlinked file outside of the repo. I guess I'll leave the code as it was then
Ghost marked this conversation as resolved
lib/obs.py Outdated
@ -180,2 +187,4 @@
return root
def _path_from_md5(self, name, dirpath, md5):
Owner

as explained later, maybe dirpath can be removed

as explained later, maybe `dirpath` can be removed
Ghost marked this conversation as resolved
lib/obs.py Outdated
@ -181,1 +188,4 @@
return root
def _path_from_md5(self, name, dirpath, md5):
cache = dirpath.joinpath(".cache/")
Owner

there is a risk of commiting this into git? maybe use cache = Path("~/.cache/git-import/").expanduser()

there is a risk of commiting this into git? maybe use `cache = Path("~/.cache/git-import/").expanduser()`
Author
First-time contributor

Hmm, I would prefer this to be in the same file system as the repos folder to avoid the copying. And the repos folder is a subdir - but without dot. So make it a cmdln parameter so I can put in $PWD. But the default sounds best like what Alberto suggested

Hmm, I would prefer this to be in the same file system as the repos folder to avoid the copying. And the repos folder is a subdir - but without dot. So make it a cmdln parameter so I can put in $PWD. But the default sounds best like what Alberto suggested
Author
First-time contributor

So the path should be ~/git-importer/repos/.cache ?

So the path should be `~/git-importer/repos/.cache` ?
Author
First-time contributor

That would work for me personally, but I'd rather do the XDG default that Alberto outlined and allow me to configure it with a command line option.

That would work for me personally, but I'd rather do the XDG default that Alberto outlined and allow me to configure it with a command line option.
Ghost marked this conversation as resolved
lib/obs.py Outdated
@ -182,0 +191,4 @@
cache = dirpath.joinpath(".cache/")
if not Path(cache).exists():
cache.mkdir()
filepath = cache.joinpath(f"{md5[0:3]}/{md5[3:6]}/{md5[6:9]}/")
Owner

I propose to use the same model than git (also drop joinpath): filepath = cache / md5[:2] / md5[2:]

I propose to use the same model than git (also drop joinpath): `filepath = cache / md5[:2] / md5[2:]`
Author
First-time contributor

As datapoint: we're talking about 1.3M files - that would be still ~5000 files per directory. The git objects are packed for a reason - there are not supposed to be 1.3M of them.

As datapoint: we're talking about 1.3M files - that would be still ~5000 files per directory. The git objects are packed for a reason - there are not supposed to be 1.3M of them.
Author
First-time contributor

This implies of course we do the .cache independent of the repo path - which I agree is a bad idea

This implies of course we do the .cache independent of the repo path - which I agree is a bad idea
Author
First-time contributor
And the .cache is actually commited: https://gitea.opensuse.org/coolo/rpm/src/branch/factory/.cache
Owner

As datapoint: we're talking about 1.3M files - that would be still ~5000 files per directory. The git objects are packed for a reason - there are not supposed to be 1.3M of them.

In that case filepath = cache / md5[:3] / md5[3:] will create 4096 entries in the first level and 300 files in each one.

> As datapoint: we're talking about 1.3M files - that would be still ~5000 files per directory. The git objects are packed for a reason - there are not supposed to be 1.3M of them. In that case `filepath = cache / md5[:3] / md5[3:]` will create 4096 entries in the first level and 300 files in each one.
Author
First-time contributor

That sounds like a good compromise

That sounds like a good compromise
Ghost marked this conversation as resolved
lib/obs.py Outdated
@ -182,0 +192,4 @@
if not Path(cache).exists():
cache.mkdir()
filepath = cache.joinpath(f"{md5[0:3]}/{md5[3:6]}/{md5[6:9]}/")
filepath.mkdir(parents=True, exist_ok=True)
Owner

you create the directories here, so you can skeip lines 192 and 193

you create the directories here, so you can skeip lines 192 and 193
Ghost marked this conversation as resolved
lib/obs.py Outdated
@ -182,0 +196,4 @@
return filepath.joinpath(f"{md5[9:]}-{name}")
def in_cache(self, name, dirpath, md5):
if self._path_from_md5(name, dirpath, md5).is_file():
Owner

return self._path_from_md5(name, dirpath, md5).exists()

`return self._path_from_md5(name, dirpath, md5).exists()`
Ghost marked this conversation as resolved
Author
First-time contributor

Please move the md5 validation from git_exporter.py into the part that does the actual download or copying. Validating a hardlink won't be necessary

Please move the md5 validation from git_exporter.py into the part that does the actual download or copying. Validating a hardlink won't be necessary
Ghost added 1 commit 2022-11-03 13:21:45 +01:00
Ghost added 4 commits 2022-11-03 13:51:14 +01:00
Owner

@nkrapp seems that you introduced some extra commits? Can you rebase?

@nkrapp seems that you introduced some extra commits? Can you rebase?
Author
First-time contributor

I tried to resolve the merge conflicts but it seems I just made it more complicated

I tried to resolve the merge conflicts but it seems I just made it more complicated
Owner

There are many ways to fix this issue. For now try this one:

  1. git rebase -i
    Very carefully, drop the 4 commits that you see here that are from coolo
  2. :wq
  3. git fetch --all
  4. git rebase origin/master
    There should not be any conflict now
  5. git push -f
There are many ways to fix this issue. For now try this one: 1) git rebase -i Very carefully, drop the 4 commits that you see here that are from coolo 2) :wq 3) git fetch --all 4) git rebase origin/master There should not be any conflict now 5) git push -f
Ghost force-pushed file-cache from cd30c3381f to 639096b548 2022-11-03 14:14:43 +01:00 Compare
aplanas approved these changes 2022-11-03 14:18:16 +01:00
aplanas left a comment
Owner

One nit in comments, but LGTM

One nit in comments, but LGTM
lib/obs.py Outdated
@ -182,0 +196,4 @@
def _path_from_md5(self, name, cachedir, md5):
filepath = cachedir / md5[:3]
cached_file = f"{md5[3:]}-{name}"
Owner

I liked more when it did not contain the {name}? This makes a pure object access. Now if two files has the same content but different name (LICENSE / license, README / README.txt) but the same content it will create two files.

I liked more when it did not contain the {name}? This makes a pure object access. Now if two files has the same content but different name (LICENSE / license, README / README.txt) but the same content it will create two files.
Author
First-time contributor

I added this mainly so I could differentiate between the files but I know this isn't really needed outside of tests so I don't mind taking it out

I added this mainly so I could differentiate between the files but I know this isn't really needed outside of tests so I don't mind taking it out
Ghost marked this conversation as resolved
Ghost added 1 commit 2022-11-03 14:23:02 +01:00
aplanas approved these changes 2022-11-03 14:23:45 +01:00
Ghost merged commit 6dd3cf3eba into main 2022-11-03 14:24:22 +01:00
Ghost deleted branch file-cache 2022-11-03 14:24:22 +01:00
Sign in to join this conversation.
No reviewers
No Label
No Milestone
No project
No Assignees
2 Participants
Notifications
Due Date
The due date is invalid or out of range. Please use the format 'yyyy-mm-dd'.

No due date set.

Dependencies

No dependencies set.

Reference: importers/git-importer#11
No description provided.