Closed Bug 1453619 Opened 6 years ago Closed 6 years ago

The telemetry event for a clientID in which a preference experiment unenroll process took place is created after more than 7 days

Categories

(Toolkit :: Telemetry, defect)

60 Branch
defect
Not set
major

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: JuliaC, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: qablocker)

Attachments

(2 files)

[Affected versions]:
- 60.0b11 build2 (20180409184545)

[Affected platforms]:
- Windows 10 x64
- Ubuntu 16.04 x86

[Preconditions]
- You need access to SHIELD Control Panel ( https://normandy-admin.stage.mozaws.net/ )
- Set the 'app.normandy.dev_mode' preference to 'true' to run recipes immediately on startup.
- Set the 'app.normandy.logging.level' preference to '0' to enable more logging.
- Set the 'security.content.signature.root_hash' preference to 'DB:74:CE:58:E4:F9:D0:9E:E0:42:36:BE:6C:C5:C4:F6:6A:E7:74:7D:C0:21:42:7A:03:BC:2F:57:0C:8B:9B:90'.
- Set the preference value for 'app.normandy.api_url' to 'https://normandy.stage.mozaws.net/api/v1'.

[Steps to reproduce]:
1. Open the SHIELD Control Panel.
2. Create, approve and publish a new preference experiment recipe. 
3. Launch Firefox having the pre-requisites set and open:
   - browser console
   - about:telemetry (locate the Events tab and choose "dynamic" in top right corner dropdown)
   - about:config (in order to check the thetoolkit.telemetry.cachedClientID value)
4. Close Firefox then check the shield-preference-experiments.json document ffrom the profile directory. 
5. Reopen Firefox and manually modify the chosen preference to a new value (different than the default value or than the value set by the experiment).
6. Open about:telemetry (locate the Events tab and choose "dynamic" in top right corner dropdown) and check the current experiment.
7. Close Firefox then check the shield-preference-experiments.json document form the profile directory. 
8. Go to https://sql.telemetry.mozilla.org/queries/50541/source, modify the params for date and client_id with the ones matching your recipe, then execute the query.
9. After 1-2 days, look for your clientID and for your unenrollment experiment name in https://sql.telemetry.mozilla.org/queries/50541/source.
  
[Expected result]:
- [step 3] - Browser console will show that the recipe has been successfully fetched and executed
           - about:telemetry will display the details about the current event (the enrollment process)
- [step 4] - The value for the "expired" field is set to "false" for this experiment
- [step 6] - The unenrollment event is created in about:telemetry
- [step 7] - The value for the "expired" field is set to "true" for this experiment
- [step 9] - The unenrollment telemetry event for this particular clientID is created and displayed in sql after 1-2 days
           
[Actual result]:
- [step 4] - The value for the "expired" field is set to "false" for this experiment
- [step 6] - Both the enrollment and the unenrollment events are created (see the screenshot https://drive.google.com/file/d/1cj6I8YKJnNMOE--IdUC2SGbxAAud7Blr/view?usp=sharing)
- [step 7] - The value for the "expired" field is set to "true" for this experiment
- [step 9] - No unenrollment telemetry event for this particular clientID is created in sql for about 9 days

[Regression range]:
- This needs more investigation (extra details about this will be provided as soon as possible)

[Additional notes]
- Here are 3 examples of clientIDs on which were unenrollled the following preference experiments:
  - b27d5d55-740e-4855-996f-017e7b151f16 - here was unenrolled 'ExpServerUnEnrollExp' on 2018-04-02
  - dc74e82-a98a-4644-baf9-39c4def84854 - here was unenrolled 'searchingevents' on 2018-04-02
  - 6cdcfc24-c336-423c-8ed0-fc02bbdc54d7 - here was unenrolled 'ubuName2' on 2018-04-02
  They can be followed on sql by executing this query: 
   SELECT *
   FROM events
   WHERE event_category='normandy'
     AND submission_date_s3 > '20180101'AND 
     (
     client_id like '%b27d5d55-740e-4855-996f-017e7b151f16' or 
     client_id like '%fdc74e82-a98a-4644-baf9-39c4def84854' or 
     client_id like '%6cdcfc24-c336-423c-8ed0-fc02bbdc54d7' )
- The unenrollment telemetry events for the above mentioned clientIDs were displayed for the first time on 2018-04-12  
- Going to the saved-telemetry-pings folder from the profile's directory, there were no unsent items
- Didn't manage yet to check if this issue is also reproducible for macOS profiles
Do you still have access to those clients? If you do, can you go through the archived "main" pings in about:telemetry and find the one that contains the unenroll events? It'll have a document id which we can use to find the exact ping in the pipeline which can give us a starting point for server-side investigation.
Flags: needinfo?(iulia.cristescu)
(In reply to Chris H-C :chutten from comment #1)
> Do you still have access to those clients? If you do, can you go through the
> archived "main" pings in about:telemetry and find the one that contains the
> unenroll events? It'll have a document id which we can use to find the exact
> ping in the pipeline which can give us a starting point for server-side
> investigation.

I don't really understand what you actually demand. 
The problematic client IDs are listed in comment 0 (see the [Additional notes] section). I also mentioned there the name of the associated preference experiments which were unenrolled too late. You can check them going to https://sql.telemetry.mozilla.org/queries/50541/source. The enrollment and also the unenrollment processes are successfully listed  in about:telemetry for each preference experiment (see the [Actual result] section, on [step 6]). Are there other specific details I can provide for the above mentioned client IDs?
Flags: needinfo?(iulia.cristescu) → needinfo?(chutten)
I'm sorry I was unclear. I was hoping to confirm that the main ping containing the events you wished to see was present on the client machine and was properly formed.

On the server, I found the following main ping for client b27d5d55...:

Shutdown at 2018-04-02T18:00:00.0+02:00
submission_date_s3 of 20180411
Session length 349s
Events:
[
{timestamp=5566, category=normandy, method=enroll, object=preference_study, string_value=ExpServerUnEnrollExp, map_values={telemetry_process=dynamic, branch=Branch1, experimentType=exp}}, 
{timestamp=5581, category=normandy, method=enroll, object=preference_study, string_value=ExpServerEnrollExp, map_values={telemetry_process=dynamic, branch=Branch1, experimentType=exp}}, 
{timestamp=6534, category=normandy, method=enrollFailed, object=addon_study, string_value=optbadname, map_values={reason=resource://normandy/lib/Addons.jsm:98:16 Error, telemetry_process=dynamic}}, 
{timestamp=6534, category=normandy, method=enroll, object=preference_study, string_value=serName, map_values={telemetry_process=dynamic, branch=serBranch, experimentType=exp}}, 
{timestamp=6550, category=normandy, method=enroll, object=preference_study, string_value=searchingevents, map_values={telemetry_process=dynamic, branch=recbranch, experimentType=exp}}, 
{timestamp=141753, category=normandy, method=unenroll, object=preference_study, string_value=ExpServerUnEnrollExp, map_values={reason=user-preference-changed, telemetry_process=dynamic, didResetValue=false}}
]

This appears to show all the events you hoped to see? 

The trouble appears to be the first two lines. The main ping was successfully created on 2018-04-02, but was only received on the pipeline (submission_date_s3) on 2018-04-11.

I'm not sure what the source of delay might have been. From the other main pings, it appears as though that client wasn't used again until 2018-04-11, so what seems likely to me is that pingsender failed to send this main ping. Thus, it had to wait until the next time that particular Firefox profile was started up, which was nine days later.

Does this help explain the behaviour? Is this consistent with how the testing worked out?

Do you still have access to the problematic clients? (To one of b27d5d55..., fdc74e82..., 6cdcfc24...?)
If so, we can find out which "main" pings transmitted the unenroll events.

Let's make sure I know what's going on and I'm not mistaken.
The bug is that the three clients listed were unenrolled from the three preference experiments listed on 2018-04-02... but that the unenroll event was not received until 2018-04-11?
Flags: needinfo?(chutten) → needinfo?(iulia.cristescu)
( The query showing the main pings I'm referencing can be found here: https://sql.telemetry.mozilla.org/queries/52558/source )
Hoping that this is the info you're looking for, here you have the content of the "saved-telemetry-pings" folder, located in the b27d5d55-740e-4855-996f-017e7b151f16 profile directory. 
I will provide the corresponding files for the other two client IDs tomorrow, when I will have access to their profile directories.
Flags: needinfo?(iulia.cristescu)
Attached image about:telemetry.png
It also seems that some experiments and studies are missing from server ping that you listed in comment 3 for the b27d5d55-740e-4855-996f-017e7b151f16 client ID.
(In reply to Chris H-C :chutten from comment #3)
> 
> I'm not sure what the source of delay might have been. From the other main
> pings, it appears as though that client wasn't used again until 2018-04-11,
> so what seems likely to me is that pingsender failed to send this main ping.
> Thus, it had to wait until the next time that particular Firefox profile was
> started up, which was nine days later.
> 
> Does this help explain the behaviour? Is this consistent with how the
> testing worked out?
> 

Yes, that's how the testing took place: those 3 clients were created on 2018-04-02 and then reopened on 2018-04-11. In this period we re-executed this query 
SELECT *
   FROM events
   WHERE event_category='normandy'
     AND submission_date_s3 > '20180101'AND 
     (
     client_id like '%b27d5d55-740e-4855-996f-017e7b151f16' or 
     client_id like '%fdc74e82-a98a-4644-baf9-39c4def84854' or 
     client_id like '%6cdcfc24-c336-423c-8ed0-fc02bbdc54d7' )

on sql, in order to verify if the server events were created or not, but without reusing those profiles.
(In reply to Iulia Cristescu, QA [:JuliaC] from comment #6)
> Created attachment 8968239 [details]
> about:telemetry.png
> 
> It also seems that some experiments and studies are missing from server ping
> that you listed in comment 3 for the b27d5d55-740e-4855-996f-017e7b151f16
> client ID.

The ping in your screenshot was assembled on 2018-04-11, the ping I digested was assembled on 2018-04-02. The query I listed has those events in it, you'll find it in the row that has a profileSubsessionCounter of 2.

So it seems as though all the events you were expecting to see are present? This bug can be resolved?
(In reply to Chris H-C :chutten from comment #8)
> (In reply to Iulia Cristescu, QA [:JuliaC] from comment #6)

> The ping in your screenshot was assembled on 2018-04-11, the ping I digested
> was assembled on 2018-04-02. The query I listed has those events in it,
> you'll find it in the row that has a profileSubsessionCounter of 2.
> 
Yes, here was my fault, I uploaded the wrong screenshot. Here is the ping assembled on 04-02-2018 https://drive.google.com/file/d/1vdQmlr0uQqzKXS6Pn9kUBB88bJtj24je/view?usp=sharing and yes, all the events correspond with the one listed by you. 

> So it seems as though all the events you were expecting to see are present?
> This bug can be resolved?

The problem stated by this bug is still up: the correspondig telemetry events were not available in sql till 2018-04-12. And, as you mentioned in comment 3, this is because pingsender failed to send this main ping. What should be the workaround here? Is there a minimum time for which the corresponding profile should be in use in order to have a successfully sent ping?
Flags: needinfo?(chutten)
pingsender is a best-effort attempt to try and send pings at Firefox shutdown. It sometimes fails to work (not often, but it doesn't have as rich of a networking stack as Firefox does to handle things like proxies. See [1] section "Submission Delay" if you'd like) in which case the old ping-sending behaviour takes over: we have to wait for Firefox to start up again.

If you want to make absolutely certain that a ping has been sent you will need to check the profile's saved-telemetry-pings directory to ensure it hasn't been missed. If it has, then starting Firefox again should send it within about 2min of startup. (It should only take about 60s after profile init, but delayed init code we put in for Quantum may make it take a little longer)

Does this make sense?

[1]: http://reports.telemetry.mozilla.org/post/projects%2Fmain_ping_delays_pingsender_beta.kp
Flags: needinfo?(chutten)
Thank you for all these useful info and documentation.  I will take them into account in the next (Beta post-backout) testing session for Normandy Telemetry feature, to ensure that this was the cause of the current issue.
Seems like we've cleared up this misunderstanding and can close the bug.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: