# Dotnet tracking ## Background The dotnet SDK has a "dotnet" command. It sends some usage data to Microsoft (via Azure application insights): ``` $ dotnet help Welcome to .NET Core 3.1! --------------------- SDK Version: 3.1.102 Telemetry --------- The .NET Core tools collect usage data in order to help us improve your experience. The data is anonymous. It is collected by Microsoft and shared with the community. You can opt-out of telemetry by setting the DOTNET_CLI_TELEMETRY_OPTOUT environment variable to '1' or 'true' using your favorite shell. Read more about .NET Core CLI Tools telemetry: https://aka.ms/dotnet-cli-telemetry ``` The linked page says: "The data is anonymous." ## The data Let's look at the information that is collected in some detail: ```json { "name": "Microsoft.ApplicationInsights.74cc1c9e3e6e4d05b3fcdde9101d0254.Event", "time": "2020-08-01T16:32:22.3019376Z", "iKey": "74cc1c9e-3e6e-4d05-b3fc-dde9101d0254", "tags": { "ai.session.id": "6a3bf75b-6cf4-455d-82ea-fcc49a9bdb55", "ai.device.osVersion": "nixos", "ai.internal.sdkVersion": "2.0.0.30671" }, "data": { "baseType": "EventData", "baseData": { "ver": 2, "name": "dotnet/cli/toplevelparser/command", "properties": { "OS Version": "20.03.2157.db31e48c5c8", "OS Platform": "Linux", "Output Redirected": "False", "Runtime Id": "nixos.20.03.2157.db31e48c5c8-x64", "Product Version": "3.1.102", "Docker Container": "False", "Current Path Hash": "67acfe0ddd44867e1e5da5ddaf25a5b90e928f523cecf614e201c683b7533cf6", "Machine ID": "", "Kernel Version": "Linux 5.4.46 #1-NixOS SMP Wed Jun 10 18:24:58 UTC 2020", "Libc Release": "stable", "Libc Version": "2.30", "verb": "37a6760fa43caf5b1ea02f22251b1456c39e060a4918ba3e963dbde336f16148", "event id": "6bf8c947-7790-4f8e-b6e3-f3d4102639fa" } } } } ``` This is from running `dotnet help` on my machine, then watching the data sent. (Luckily the data is sent over TLS and certificates are checked, so this was intercepted by me issuing a cert for the `dc.services.visualstudio.com` hostname under my own CA.) Three pieces of information are interestingly revealing here: - "Current Path Hash" - "Machine ID" - "OS Version" These are present in the JSON blob and match the definitions of the fields at: https://docs.microsoft.com/en-gb/dotnet/core/tools/telemetry#data-points ### Current Path Hash Current Path Hash is a SHA256 hash of the current working directory, represented in hexadecimal. We can get the same value by running: ``` $ echo -n $PWD | sha256sum 67acfe0ddd44867e1e5da5ddaf25a5b90e928f523cecf614e201c683b7533cf6 - ``` Of note here is there is no salt, so this once this value is reversed to its actual value, any entry in the database would be the same, allowing easily linking of records. Effectively this value while hashed should be treated as identical to the original data. In this case I ran it in my home directory (`/home/dgl`). Running `dotnet help` in the default directory that a terminal opens in seems like quite a common operation. While "/home/dgl" may not be a totally unique username, it likely qualifies as Personally Identifiable Information (PII). As the data includes the platform, it reduces the search space, for example on a Windows platform a path like `C:\Users\USERNAME` would be more common and could be searched for. ### Machine ID Machine ID is the MAC address of the machine. Again it is run through SHA256, but without a salt. To confirm the Machine ID is the MAC address on my machine I ran: ```sh $ ip link 3: eno1: mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 link/ether xx:xx:xx:xx:xx:xx brd ff:ff:ff:ff:ff:ff ``` This gives me a MAC address of xx:xx:xx:xx:xx:xx. I can then uppercase that and generate a SHA256 sum: ```sh echo -n xx:xx:xx:xx:xx:xx | tr 'a-z' 'A-Z' | sha256sum [removed in this version of report] - ``` Giving the same output as "Machine ID" in the JSON blob above. The first issue is this now gives a way to link between my run of `dotnet help` in my home directory and any other run of the `dotnet` command on this machine. Because that was enough to identify me, it then combined with another identifier means this identifier becomes PII by association. However the second issue is a MAC address is an identifier on its own. It is hashed, but without a salt, so similarly reversible to how the path is. Naively the number of possibilities is 2^48 (i.e. 48-bits in a MAC address), however not all address prefixes are assigned. The list of assigned addresses can be downloaded from: http://standards-oui.ieee.org/oui.txt A simple count for lines that look like (`000000 (base 16)`) finds us the number of currently assigned: ```sh $ egrep -ci '^[a-f0-9]{6}' oui.txt 28319 ``` This means to try all possible MAC addresses is actually 28319 * 2^24. (Note that actually some addresses used by VMs and other adaptors may not be taken from the assigned ranges, but the point being made that the search space is considerably less than 2^48 still stands, with a little more research into actually used ranges.) On my machine I can do 4812580 hashes of 16 byte blocks per second (based on `openssl speed`): ``` $ openssl speed sha256 Doing sha256 for 3s on 16 size blocks: 14389617 sha256's in 2.99s ``` i.e. 14389617 / 2.99 = 4812580 (28319 * 2^24) / 4812580 98723 So around 27 hours to bruteforce reverse a single MAC address. However this is using a single core on a modern but fairly low power processor (Core i5-9400T). There are claims a GPU can operate at 200 million hashes per second, I haven't tried, but I suspect either GPUs or parallelising via the cloud would make it possible to do this in under an hour, potentially making it feasible to reverse a significant amount of data in a few days. ### OS version In my case the OS version includes the generation hash of NixOS, this is actually the hash of my personal configuration, which would be true for any NixOS user, another potentially unexpected source of PII, but PII nonetheless. ## Conclusion The claim that telemetry is anonymous is untrue; due to a poor choice of a hash function that is reversible for the inputs used it is possible to reverse the data to actual PII. The data cannot even be considered pseudonymised as it is possible to reverse it without any additional data. Once some data has been reversed it is possible to use this to link data together. For example a user reporting an issue on an issue tracker may include output that includes the working directory, this could then be hashed and simply looked up in the database. With a little bit of brute force this is then reversible to the MAC address of the machine the user ran the command on. This means we have to trust Microsoft with this data, however now it has been proven that the data can be used for more than Microsoft state it is used for on https://docs.microsoft.com/en-gb/dotnet/core/tools/telemetry Recommendations: - Remove the telemetry, it avoids any potential GDPR issues; - Delete the historical data (I am not a lawyer, but I suspect this has GDPR implications); - If telemetry is kept, remove the "Current Path Hash", "Machine ID" and consider santising OS version to remove potential unique identifiers. This is a submission for the dotnet core bug bounty program. I will aim to publicly disclose this once the issue has been resolved, or if no other agreement is reached within a reasonable timeframe (I understand this depends on releases of dotnet core, so I would consider that to be as much as 6 months in this case). I am not a lawyer, so this wording may be too vague; however I understand there are GDPR implications to these findings, I am not treating this disclosure as exercising my rights under the GDPR at this time.