implement file caching #11

Ghost · 2022-11-02T17:04:57+01:00

Ghost commented

2022-11-02 17:04:57 +01:00

First-time contributor

So we don't have to download files multiple times we want to cache files and compare with the MD5 checksum if the file is already in cache

fixes: https://gitea.opensuse.org/importers/git-importer/issues/2

So we don't have to download files multiple times we want to cache files and compare with the MD5 checksum if the file is already in cache fixes: https://gitea.opensuse.org/importers/git-importer/issues/2

aplanas reviewed 2022-11-02 17:26:45 +01:00

lib/obs.py Outdated

						
				@ -165,0 +166,4 @@

				        if not self.in_cache(name, dirpath, file_md5):

				            with (dirpath / name).open("wb") as f:

				                f.write(self._download(project, package, name, revision).read())

				            shutil.copy(dirpath / name, cached_file)

aplanas commented

2022-11-02 17:26:17 +01:00

Try firt a hardlink to save space and speed up the cache:

try:
    cached_file.hardlink_to(dirpath / name)
except OSError:
    # Maybe a different mount point
    # TODO use bindmount for files?
    shutil.copy(dirpath / name, cached_file)

You can use the same later

Try firt a hardlink to save space and speed up the cache: ```python try: cached_file.hardlink_to(dirpath / name) except OSError: # Maybe a different mount point # TODO use bindmount for files? shutil.copy(dirpath / name, cached_file) ``` You can use the same later

aplanas commented

2022-11-03 08:04:30 +01:00

Uhm there is the risk of changing the cached version after the next update. Maybe is better to unlink first the fileif present (IIRC there is a parameter to remove any error in the file does not exists, removing the need of calling .exists() before the unlink)

Ghost commented

2022-11-03 10:59:29 +01:00

First-time contributor

So we want to unlink the normal file (not the file in cache) every time at the start and then download or link again depending on if it is already in cache?

aplanas commented

2022-11-03 11:01:03 +01:00

I think so, right? If we unlink we should be keeping the cached version (copy or link) without risk of changing the content. What do you think?

Ghost commented

2022-11-03 11:12:55 +01:00

First-time contributor

I agree it's the better solution, I just wanted to make sure that I understand your logic correctly

Ghost commented

2022-11-03 11:32:06 +01:00

First-time contributor

If you afraid of files getting corrupted, why hardlink at all?

Ghost commented

2022-11-03 11:32:39 +01:00

First-time contributor

Do we know how git handles files in its working copy when you e.g. checkout a different branch?

aplanas commented

2022-11-03 11:36:18 +01:00

Can be copying 1.3M files a bit expensive? (I did not measure it, but my crappy HD is killing me every morning)

aplanas commented

2022-11-03 11:37:57 +01:00

Do we know how git handles files in its working copy when you e.g. checkout a different branch?

When possible, hardlinks: https://git-scm.com/docs/git-clone/

> Do we know how git handles files in its working copy when you e.g. checkout a different branch? When possible, hardlinks: https://git-scm.com/docs/git-clone/

Ghost commented

2022-11-03 11:45:50 +01:00

First-time contributor

It only talks about objects cloned from a local repository. And 1.3M files is for all the repos - and right now I'm downloading them again and again from OBS, which is actually worse :)

But I asked @nkrapp to check how git checkout treats files in the WC and then we can decide.

It only talks about objects cloned from a local repository. And 1.3M files is for all the repos - and right now I'm downloading them again and again from OBS, which is actually worse :) But I asked @nkrapp to check how git checkout treats files in the WC and then we can decide.

Ghost commented

2022-11-03 12:01:01 +01:00

First-time contributor

The git documentation is very ambigous here and doesn't say outright how checkout treats files but they used the wording updating the index and the files in the working tree (https://git-scm.com/docs/git-checkout/en) so I'd expect it to actually edit the files

The git documentation is very ambigous here and doesn't say outright how checkout treats files but they used the wording `updating the index and the files in the working tree` (https://git-scm.com/docs/git-checkout/en) so I'd expect it to actually edit the files

aplanas commented

2022-11-03 12:31:44 +01:00

Can be fun to check if this is true, like creating a repo, commiting a file, creating a branch, updating the file in the branch, and create a hard link outside of the repo. If a checkout into master update also the external link we have then only one option (do the copy and forget)

Ghost commented

2022-11-03 12:56:46 +01:00

First-time contributor

Just tested it and changing a file in a branch also updated the hardlinked file outside of the repo. I guess I'll leave the code as it was then

👍 1

Ghost marked this conversation as resolved

lib/obs.py Outdated

						
				@ -180,2 +187,4 @@

				        return root

				    def _path_from_md5(self, name, dirpath, md5):

aplanas commented

2022-11-02 17:18:42 +01:00

as explained later, maybe dirpath can be removed

as explained later, maybe `dirpath` can be removed

Ghost marked this conversation as resolved

lib/obs.py Outdated

						
				@ -181,1 +188,4 @@

				        return root

				    def _path_from_md5(self, name, dirpath, md5):

				        cache = dirpath.joinpath(".cache/")

aplanas commented

2022-11-02 17:17:05 +01:00

there is a risk of commiting this into git? maybe use cache = Path("~/.cache/git-import/").expanduser()

there is a risk of commiting this into git? maybe use `cache = Path("~/.cache/git-import/").expanduser()`

Ghost commented

2022-11-02 18:02:33 +01:00

First-time contributor

Hmm, I would prefer this to be in the same file system as the repos folder to avoid the copying. And the repos folder is a subdir - but without dot. So make it a cmdln parameter so I can put in $PWD. But the default sounds best like what Alberto suggested

Ghost commented

2022-11-03 09:55:02 +01:00

First-time contributor

So the path should be ~/git-importer/repos/.cache ?

So the path should be `~/git-importer/repos/.cache` ?

Ghost commented

2022-11-03 09:56:31 +01:00

First-time contributor

That would work for me personally, but I'd rather do the XDG default that Alberto outlined and allow me to configure it with a command line option.

👍 1

Ghost marked this conversation as resolved

lib/obs.py Outdated

						
				@ -182,0 +191,4 @@

				        cache = dirpath.joinpath(".cache/")

				        if not Path(cache).exists():

				            cache.mkdir()

				        filepath = cache.joinpath(f"{md5[0:3]}/{md5[3:6]}/{md5[6:9]}/")

aplanas commented

2022-11-02 17:13:34 +01:00

I propose to use the same model than git (also drop joinpath): filepath = cache / md5[:2] / md5[2:]

I propose to use the same model than git (also drop joinpath): `filepath = cache / md5[:2] / md5[2:]`

Ghost commented

2022-11-02 18:00:21 +01:00

First-time contributor

As datapoint: we're talking about 1.3M files - that would be still ~5000 files per directory. The git objects are packed for a reason - there are not supposed to be 1.3M of them.

Ghost commented

2022-11-02 18:11:16 +01:00

First-time contributor

This implies of course we do the .cache independent of the repo path - which I agree is a bad idea

Ghost commented

2022-11-02 18:41:34 +01:00

First-time contributor

And the .cache is actually commited: https://gitea.opensuse.org/coolo/rpm/src/branch/factory/.cache

aplanas commented

2022-11-03 08:07:47 +01:00

As datapoint: we're talking about 1.3M files - that would be still ~5000 files per directory. The git objects are packed for a reason - there are not supposed to be 1.3M of them.

In that case filepath = cache / md5[:3] / md5[3:] will create 4096 entries in the first level and 300 files in each one.

> As datapoint: we're talking about 1.3M files - that would be still ~5000 files per directory. The git objects are packed for a reason - there are not supposed to be 1.3M of them. In that case `filepath = cache / md5[:3] / md5[3:]` will create 4096 entries in the first level and 300 files in each one.

Ghost commented

2022-11-03 08:15:38 +01:00

First-time contributor

That sounds like a good compromise

Ghost marked this conversation as resolved

lib/obs.py Outdated

						
				@ -182,0 +192,4 @@

				        if not Path(cache).exists():

				            cache.mkdir()

				        filepath = cache.joinpath(f"{md5[0:3]}/{md5[3:6]}/{md5[6:9]}/")

				        filepath.mkdir(parents=True, exist_ok=True)

aplanas commented

2022-11-02 17:17:54 +01:00

you create the directories here, so you can skeip lines 192 and 193

Ghost marked this conversation as resolved

lib/obs.py Outdated

						
				@ -182,0 +196,4 @@

				        return filepath.joinpath(f"{md5[9:]}-{name}")

				    def in_cache(self, name, dirpath, md5):

				        if self._path_from_md5(name, dirpath, md5).is_file():

aplanas commented

2022-11-02 17:08:15 +01:00

return self._path_from_md5(name, dirpath, md5).exists()

`return self._path_from_md5(name, dirpath, md5).exists()`

Ghost marked this conversation as resolved

Ghost commented

2022-11-02 18:13:38 +01:00

First-time contributor

Please move the md5 validation from git_exporter.py into the part that does the actual download or copying. Validating a hardlink won't be necessary

aplanas commented

2022-11-03 13:53:17 +01:00

@nkrapp seems that you introduced some extra commits? Can you rebase?

Ghost commented

2022-11-03 13:58:00 +01:00

First-time contributor

I tried to resolve the merge conflicts but it seems I just made it more complicated

aplanas commented

2022-11-03 14:01:34 +01:00

There are many ways to fix this issue. For now try this one:

git rebase -i
Very carefully, drop the 4 commits that you see here that are from coolo
:wq
git fetch --all
git rebase origin/master
There should not be any conflict now
git push -f

There are many ways to fix this issue. For now try this one: 1) git rebase -i Very carefully, drop the 4 commits that you see here that are from coolo 2) :wq 3) git fetch --all 4) git rebase origin/master There should not be any conflict now 5) git push -f

Ghost force-pushed file-cache from cd30c3381f to 639096b548

2022-11-03 14:14:43 +01:00

Compare

aplanas approved these changes 2022-11-03 14:18:16 +01:00

aplanas left a comment

One nit in comments, but LGTM

lib/obs.py Outdated

						
				@ -182,0 +196,4 @@

				    def _path_from_md5(self, name, cachedir, md5):

				        filepath = cachedir / md5[:3]

				        cached_file = f"{md5[3:]}-{name}"

aplanas commented

2022-11-03 14:17:45 +01:00

I liked more when it did not contain the {name}? This makes a pure object access. Now if two files has the same content but different name (LICENSE / license, README / README.txt) but the same content it will create two files.

Ghost commented

2022-11-03 14:21:37 +01:00

First-time contributor

I added this mainly so I could differentiate between the files but I know this isn't really needed outside of tests so I don't mind taking it out

Ghost marked this conversation as resolved

Ghost added 1 commit 2022-11-03 14:23:02 +01:00

change cached file naming pattern 8aed76e52a

aplanas approved these changes 2022-11-03 14:23:45 +01:00

Ghost merged commit 6dd3cf3eba into main

2022-11-03 14:24:22 +01:00

Ghost deleted branch file-cache

2022-11-03 14:24:22 +01:00