Hey, I'm trying to make a pluging to include dvc f...
# plugins
c
Hey, I'm trying to make a pluging to include dvc files. I already added a rule to automatically generate a target for each dvc file found
Copy code
class DvcTarget(Target):
    alias = "dvc_files"
    core_fields = (
        *COMMON_TARGET_FIELDS,
        SingleSourceField,
    )


@dataclass(frozen=True)
class PutativeDvcTargetsRequest(PutativeTargetsRequest):
    pass


@rule(level=LogLevel.DEBUG, desc="Determine candidate dvc targets to create")
async def find_putative_targets(
    req: PutativeDvcTargetsRequest,
    all_owned_sources: AllOwnedSources,
    # python_setup: PythonSetup,
) -> PutativeTargets:
    pts: List[PutativeTarget] = []
    all_dvc_files_globs: PathGlobs = req.path_globs("*.dvc")
    all_dvc_files = await Get(Paths, PathGlobs, all_dvc_files_globs)
    new_dvc_files = set(all_dvc_files.files) - set(all_owned_sources)

    for file in new_dvc_files:
        logger.info(f"Found dvc file {file}")
        dirname = os.path.dirname(file)
        file_name = os.path.basename(file)
        file_base = os.path.splitext(file_name)[0]
        pts.append(
            PutativeTarget.for_target_type(
                DvcTarget,
                dirname,
                name=file_base,
                triggering_sources=[file],
                kwargs={"source": file_name},
            )
        )

    return PutativeTargets(pts)
I just don't know which classes I have to use to tell it to include the files from the dvc file. My goal is that I can add these dvc targets as dependencies to python scripts. This is my current attempt but I think I'm using the wrong components
Copy code
class GenerateDvcFileRequest(GenerateSourcesRequest):
    input = SingleSourceField
    output = FileSourceField

@rule
async def generate_dvc_file(
    request: GenerateDvcFileRequest,
) -> GeneratedSources:
    sources_requests = await Get(
        DigestContents, Digest, request.protocol_sources.digest
    )
    assert len(sources_requests) == 1
    sources_request = sources_requests[0]
    source_yaml = yaml.load(sources_request.content, Loader=yaml.FullLoader)
    if not isinstance(source_yaml, dict):
        raise ValueError(f"Invalid yaml file")
    logger.info(f"source_yaml: {source_yaml}")
    wdir = source_yaml.get("wdir", ".")
    outs = source_yaml.get("outs", [])
    files = [o.get("path") for o in outs]
    file_digests = await MultiGet(
        Get(Digest, PathGlobs, PathGlobs([f for f in files])) for f in files
    )
    snapshot = await Get(Snapshot, Digest, request.protocol_sources.digest)
    return GeneratedSources(snapshot)
c
Your
DvcTarget
class needs to derive from
ResourceTarget
https://github.com/pantsbuild/pants/blob/d955f0b6c4914f367d54a50fe3a7270f39da84e8/src/python/pants/core/target_types.py#L552 in order for it to be included as a python resource (to behave as https://www.pantsbuild.org/stable/docs/using-pants/assets-and-archives#resources ). If you prefer them to be treated as
file
target, adjust the base class accordingly. 😉
c
Hey, thank you for the help! Our dvc stores model weights so it would probably be better to use the following target:
Copy code
class DvcTarget(FileTarget):
    alias = "dvc_files"
So the goal would be to then look at the content of the .dvc file, and including all the files that are loaded by it. This means reading the yaml. Looking at the
wdir
and
outs: path
to find all the files/folders that belong to this. Glob them. And add them to the Snapshot. I just don't seem to be able to do this part. I'm completely new to pantsbuild and it's not very clear to me which Request I should use for this. I tried to following but it doesn't seem to be called even when I add the
dvc_files
as an explicit dependency to a target and run it
Copy code
class DvcGeneratorTarget(FilesGeneratorTarget): ...


@rule
async def generate_dvc_file(
    request: FilesGeneratorTarget,
) -> GeneratedSources:
    raise NotImplementedError()
g
So just at a glance, your rule has to take a
GenerateSourcesRequest
derivative, which declares the to/from source types of your rule. https://github.com/tgolsson/pants-backends/blob/1c49594919b62ce418d864564d0a84e31c927dfb/pants-plugins/kustomize/pants_backend_kustomize/codegen.py#L49-L59
Knowing very little of DVC, it can be helpful (sometimes...) to implement a Package rule for your backend, and then implement your code-generator in terms of that. Whether that makes sense for DVC I can't tell, but if you're thinking of generating a resource it could be good. That would allow you to
pants package foo:bar
independently and your code generator is just "Build packages and forward". There's
pants export-codegen
which can also help with debugging.