any chance I could get some more eyes on <https://...
# development
h
any chance I could get some more eyes on https://github.com/pantsbuild/pants/pull/8973 ? particularly @witty-crayon-22786 or @average-vr-56795
w
hypothetically, HttpRequester is not side-effecting...?
just potentially not cacheable?
h
HttpRequester
is definitely side-effecting if we make it the standard way to do all HTTP operations (including POSTs), which is what I had in mind with it
I only added GET functionality in this PR since it's what toolchain needs first, but, assuming we go with this design for making HTTP requests in general, we'd definitely want to add side-effecting functionality to it, ideally soon
maybe I should've documented that more clearly in the PR or docstrings
w
are you envisioning that we would have a separate non-sideeffecting HTTP api that could be used deeper in the graph?
h
no, I'm envisioning that we wouldn't do this, all HTTP requests would happen through
HttpRequester
, other than the
UrlToFetch
abstraction we already have
I don't think we can build such a separate non-sideeffecting HTTP api safely right now anyway, since that depends on 6598
w
...given that we are 100% certain that we need to implement #6598, does that change your answer?
h
I'm honestly not sure. part of me genuinely likes the idea that we keep the class of
@side_effecting
types, and keep the constraint that
@side_effecting
types can only be requested in a
@goal_rule
. this constraint forces rule implementors to "push out" operations that are conceptually side effecting to `@goal_rule`s. and I think that the other option, which is to just let ordinary
async Get[HttpResponse](HttpRequest, req)
calls work (while dirtying a bunch of the node graph, or sometimes dirtying a bunch of the node graph and sometimes not depending on the type and options of the
HttpRequest
) , makes it harder to think about when a rule will and will not be cached, which is undesirable
on the other hand, it would be convenient for rule authors to be able to just write ``async Get[HttpResponse](HttpRequest, req)` and the engine take care of invalidating only the things that need to be invalidated when the result of this HTTP request needs to change
do you agree that this tension between these two ways of implementing HTTP (and more broadly any type of side-effecting rule, including EPR if we implement the
foreground
flag) within pants exists?
w
i agree that side-effecting should be at the root... that makes sense. the part that doesn't make sense is: not all HTTP requests are sideeffecting.
foreground is uncacheable, but it is still "sandboxed" to the same extent that all process invokes are
...so it is not "sideeffecting", any more than any process is.
h
I think we're operating from different perspectives of what side-effecting means
I've been thinking about side-effecting as meaning anything that causes pants to have to wait for external input at the time that the user runs a command - so, waiting on an HTTP requests counts, as does running a --debug process (becuase pants has to wait for the user to type in commands live into the terminal)
w
i think that sideeffecting has a pretty distinct meaning: "making (untracked?) changes to the outside world". making a POST request to a web service, or deleting/editing a file in your repository (or anywhere on your system, for that matter) would be sideeffects
h
and I think you're calling a subset of those things "uncacheable" but not "side-effecting", and arguing that it's okay if uncacheable but not side-effecting effects happen in the engine, dirtying nodes in the way that 6598 will implement - and that only things that are truly side effecting from the perspective of affecting the outside world at runtime actually need to be pushed into `@goal_rule`s
w
i don't think that user input in to a process is a sideeffect, any more than changing your CLI args is a sideeffect.
user input does mean that you would want to re-run if the inputs might have changed... and that requires it not to be cached, because we can't record the user inputs and replay them (and wouldn't want to if we could)
h
ah, and I've been thinking of user input to a process as being a side-effect, because in practice doing this requires guaranteeing that we grab the terminal
w
another heuristic to think about sideeffecting things is "is it safe to retry automatically without any configuration?" (although this is partial: there are sideeffecting things that are safe to retry)
we should absolutely, 100% automatically retry processes and UrlToFetch requests automatically
h
I don't actually think this is safe to do with ordinary HTTP GET requests, because of timeouts
if a server is down, we don't want pants to sit there spinning making requests at it that will never succeed
so in this sense, having to deal with timeouts "re-introduces" side-effecting semantics to HTTP GET, even though GETs are, on their own terms, supposed to be idempotent
w
i'm not suggesting infinite retry.
retrying and failure do not introduce "side-effects". not by the definition above.
being non-deterministic is not a side-effect.
e
I'm having trouble tracking all this. To put a finer point on things, how is HttpRequester different from InteractiveProcessRunner? Is there a specific difference that causes pause with the HttpRequester PR or is it rather a stop-the-bleeding concern, ie: we shouldn't have been doing it this way for InteractiveProcessRunner either. If the latter, does our repeated quoting of "private APIs still" underwrite moving forward?
h
@enough-analyst-54434 I don't think it is different. I mentioned on https://github.com/pantsbuild/pants/pull/8974#issuecomment-574924016 that I think if we go with stu's conception of what is and is not side-effecting, we probably don't want IPR as a distinct thing from EPR
w
HttpRequester and IPR are similar. in the IPR case we have an alternative API that works for the
test --debug
case, but not for the
run
case. similarly, in the HttpRequester case, there is likely a non-sideffecting API that could be used deeper in the graph to make GET requests.
e
That suggests moving forward then with HttpRequester and circling back later if unification / better ideas eventually come along.
w
maybe. but there are also alternatives that have been discussed a few times, including writing whatever logic is hitting the network as a process
for example: would i currently suggest running PEX in process? no: i would suggest subprocessing it
depending what kind of logic we're trying to write, the answer might be "it should be in another process".
if this is blocking something important for you, and doing this in process is truly the right answer, then i won't block. i just don't have perspective on the backstory and motivations.
e
depending what kind of logic we're trying to write, the answer might be "it should be in another process".
(I don't understand slack) ... It feels like we've been unable to pin this down for several months or more and this seems like a basic thing (foundational) that we need to have ready answers for that don't require consulting oracles. I'm guessing everyone would like to have this nailed down but it requires more time than anyone is willing to invest.
w
basic and foundational are two very different things.
i'm not blocking here. rather than an oracle, you should consider me to be someone advising caution in fundamental apis
e
They are not. To a rule write basic may happen to be foundational. They don't really care about the distinction.
w
the foundation decides what should be basic or not basic... easy vs hard vs impossible
writing a complete network aware system in the engine is an existential question: is that what it is for, or not?
e
The oracle bit is what it feels like though, I think the reason for that is we don't have a high enough ratio of folks all understanding the same thing. We need a higher ratio to make fluid forward progress. I'm not sure how to get there.
The engine conversations have often felt painful for a good while.
w
@enough-analyst-54434: i would say that some of the reason this feels oracle like is that i have no idea what the actual usecase is, and need to speak in generalizations. this feels like an example of the kind of trap twitter frequently gets in where we attempt to make a change without motivating it, and you push back wanting to know the usecase
if it would help to meet and discuss the usecase privately, i'm happy to.
e
Makes sense. I'm not exactly sure of the usecase - I'm away too much, but fwict its calling external HTTP APIs.
I would guess that would be a pretty routine thing for a rule author to need to do? Could be wrong.
w
the only usecase i know of (literally! not because i doubt that you have one!) is to implement a resolver
and we have moved in the direction of having that out of process, because it is language specific
e
I can imagine Medium wanting to post info to datadog. They did that sort of thing back when I worked with them,
w
(ivy, coursier, pex, yarn, go fetch, etc)
e
That;s the 1st off the top.
w
https://pantsbuild.slack.com/archives/C0D7TNJHL/p1579649573065700?thread_ts=1579307375.032400&amp;cid=C0D7TNJHL metrics gathering feels inherent to the graph (similar to the workunit stuff) rather than something that you'd put in a rule....
but rather than hypotheticals: @hundreds-breakfast-49010: would it be useful for you to write privately somewhere about what the usecase is...? or for us to meet?
e
And I really don't think this is private is it @hundreds-breakfast-49010?
Thinking broadly though, toolchain will clearly provide some for-pay Pants functionality, no doubt some that form part of the rule graph. To do so, we'll need to talk to an API to perform some proprietary work. To do so, we'd currently need to write a small CLI to wrap any api call. That would clearly work, it just feels a bit unfriendly imagining a world where various companies provided various for-pay rules. A first class API like EPR seems a bit more friendly. The extreme here is Pants writes a CLI for requests and packages it up as a cross-platform pex. We then have a set fof types and rules that expose what looks like an API to make HTTP requests but actually does an UrlFetch of that generaic HTTP client pex and EPRs it.
Maybe that makes sense, but it seems a bit baroque.
h
ah sorry missed this convo, my slack page seems to have lost its internet connection without me noticing it
but yeah what @enough-analyst-54434 said is what toolchain wants, and it's not private at that level of abstraction
I'm trying to get a first-class, generic, HTTP API merged into pants, starting with GET requests, so that toolchain can write pants rules that make requests against toolchain's API and perform different business logic depending on the results of the request
writing an entirely separate script that makes an HTTP request, and running it as an EPR is awkward; and also we don't currently have a way to tell an EPR to stop paying attention to its cached result after a timeout and run again
w
https://pantsbuild.slack.com/archives/C0D7TNJHL/p1579658717067900?thread_ts=1579307375.032400&amp;cid=C0D7TNJHL this is evident from what we have discussed, but doesn't do much to clarify. i really think we need to talk about the concrete usecase.
e
https://pantsbuild.slack.com/archives/C0D7TNJHL/p1579658717067900?thread_ts=1579307375.032400&amp;cid=C0D7TNJHL is pretty concrete. Let's say the API takes a list of python requirements and produces a list of distribution urls. The API is protected by authentication. We want to write a rule that uses the API. I can't imagine other relevant details.
a
Do we expect deterministic results from that API?
(Assuming good auth)
e
Yes
a
So for that use-case: • No observable side effects (but hitting the cache may save some money) • Deterministic output • No need for special invalidation of dependees • Auth should probably be a special concern where it’s a param to the rule, but doesn’t form part of the caching identity (i.e. multiple auth tokens may be considered equivalent in terms of getting cache hits, but should be considered distinct in terms of if we got an auth failure, and retry with a new auth token, we actually perform the request • We may need to integrate with some kind of non-deterministic auth subsystem to provide that param - ideally we don’t need to re-auth every build, but also ideally the graph would have functionality like “If I see an auth failure, I’ll re-auth and retry the rule” (the last two are very similar to a concern we have for remote execution auth (and punted on))
e
And I'd argue that use case is likely basic to dredge up contentious framing. If it's difficult or impossible to implement we have a problem to solve.
a
I like the use-case. I’m a little scared that there probably are side-effecting use-cases. But if we’re happy to build a solution which doesn’t solve them, that works. For that case, it sounds like the only thing we’re missing is the way to inject an auth token, and specialising the caching to handle that? Would we be happy specialising the problem specifically to oauth, and providing an oauth intrinsic which the engine treats specially?
w
if daniel's guesses about the usecase are correct, it doesn't seem like the API should be marked sideeffecting, and should thus be usable deeper in the graph
(which would make it much more useful, and which is where i thought Greg's design doc was headed: https://docs.google.com/document/d/1CdlGMXWxdwpGqet7-k6DFIbg90IqR75cFg8SkRu06H8/edit )
a
(But the auth thing is side-effecting, in a really specific weird way, which is why it would be nice to special case that if possible… If we want to generally solve that kind of problem, we do need to solve the whole big scary one)
(Effectively the auth node would be the scary one, not the http node)
w
yea, i think that the auth thing was raised in the context of https://github.com/pantsbuild/pants/issues/8770 ... (see the last comment)? but i feel like auth/caches of that sort should be an intrinsic
@hundreds-breakfast-49010: if the above is true, it's exciting, because the intrinsic can go deeper in the graph, right?
h
what do you mean by "the above"?
I like the use-case. I’m a little scared that there probably are side-effecting use-cases. But if we’re happy to build a solution which doesn’t solve them, that works.
The side-effecting type
HttpRequester
does solve side-effecting http use cases automatically
w
the last 5 or 6 comments from daniel and i
h
I'm not sure if it makes sense to assume oauth or any other type of http authenticiation into a pants intrinsic
I'm not actually particularly knowledgable about the details of oauth (or any other auth system), so I'm not sure what it would entail to do so
I think @average-vr-56795’s argument is that, if we had the engine automatically take care of authentication, we'd be able to get non-side-effecting semantics out of GET again and therefore let a GET request intrinsic be okay within the cacheable part of the rule graph
this is basically a question of whether we want pants to assume some details of how auth is going to work for any client of pants, or have pants give rule-authors generic tools to handle their own auth, and I lean towards the latter
w
not the oauth+http part, i think: just the ability to create a tiny persistent cache somehow.
i think that the auth/persistent-cache part is somewhat orthogonal (...well, it's related. but building in support for HTTP auth would definitely jive with intrinsic usage of http.)
the larger question is whether you agree that your concrete usecase does not require sideeffects in http
ie: auth requiring sideeffects is different.
h
within http our concrete usecase doesn't need HTTP sideeffects
so if we're treating auth sideeffects and also timeout/retry sideeffects as different, then yes a GET http request doesn't side-effect
w
timeout/retry are not sideffects
they're non-determinism.
so yea, i think that that means you're good moving the API deeper.
which, again: i think is really exciting.
h
I'm good moving the GET API deeper, closer to what I had in mind with https://docs.google.com/document/d/1CdlGMXWxdwpGqet7-k6DFIbg90IqR75cFg8SkRu06H8/edit , but that means we have to first solve the problems around non-determinism deep within the rule graph
which is something we want to do anyway
w
we're forced to address the split with IPR and foreground, because some concrete usecases need sideeffects there. but not here with http
yep.
i drafted 6598 on the plane back from oneteam, but haven't tested it yet.
👍 1
h
okay, so once that's merged, we'll be able to write a HTTP get intrinsic that will handle re-auth/timeouts/HTTP failures in a way that rules don't have direct access to
w
yep!
i'll take a swing either tonight or this weekend.
h
that intrinsic will be marked as
uncacheable
as just a property of that specific intrinsic (other intrinsics are still
cacheable
)
w
yep.
with ... some sort of semantics.
the implementation details of intrinsics are very private, and very malleable.
so replacing the
Node::cacheable
flag with something more nuanced is 100% fine.
h
and it will have to return some type that's generic over success/failure, like
FallibleHttpResponse
- becuase what non-determinism means is that, with the exact same
HttpGetRequest
type fed to the engine, the intrinsic might return a success or failure, depending on all the details of timeouts, etc. that the intrinsic will take care of
w
i agree with Daniel's comment on the design doc about that aspect.
h
which comment is that?
h
ah yeah
w
basically: user code should generally not need to deal with an HTTP request failing... although 404 might be a weird special case in this regard.
h
that's something the intrinsic will have to decide
w
i don't think so
h
of course it could decide this based on options in the
HttpGetRequest
type it gets fed
w
...oh. i see what you mean. with regard to whether it retries?
yea, i suppose. that's clear-ish from HTTP semantics though.
(client errors vs server errors)
h
yeah I can see the argument that you shouldn't retry a 4xx error based on http's semantics
the last thing that I'm worried about is building assumptions about auth into the intrinsic's internal retry logic
w
as Daniel mentioned above, adapting UrlToFetch might be an option here (vs having a separate intrinsic). but i don't have a preference either way.
h
like I said I don't understand all the ways you can auth a http request super well
w
me neither. but that would have needed to be figured out either way.
h
UrlToFetch
is implemented with an intrinsic today
so the question is, is there some form of http auth such that we're going to regret leaving it to an intrinsic to handle
I know that there are some systems where you make a request to an auth URL, get an auth token that's valid for say 1 hour, and then you craft your subsequent requests to an API with that token
how would a rule author write a rule that could cope with this?
w
see above re: the cache intrinsic.
in short, have a tiny intrinsic that hits a persistent cache (with a configurable ttl if necessary) in the Store.
h
what's the API you're conceiving of this intrinsic using?
w
hm? something like
await Get[OptionalCacheEntry](CacheKey(..))
h
https://gist.github.com/gshuflin/6d7778bad0a4feb1690f34afb7ad5268 <- I'm just sketching out ideas of what it might feel like to actually use the intrinsic API in practice