Personal City-Level Location Prediction Based on Behavior via Machine Learning

, Munich

Before I started my Erasmus I thought it could be fun to collect some data about myself while traveling so that I can visualize and analyze them later.
Here I will try to do exactly that without diving into any theoretical explanations of why I did what I did, but I'll add some interpretations of the results.

Goals

This post has two goals:

  1. Explore and visualize my personal data from my Erasmus time (and a bit after)
  2. Predict the city I am in based on my behavior (number of steps on a day, money spent, and number of photos/videos taken)

For goal 1, I will use data I collected myself from my phone and smartwatch, as well as data that Google collected about me.
For goal 2, I will use only the data that I collected myself.

I'll consider goal 2 a success if I can feed in the following day that wasn't part of the training data and get a prediction that is correct.
On December 19, 2023 I was in Munich, I took 2575 steps, spent no money, and took no photos/videos.


example = {
    'steps': 2575,
    'money_spent_offline': 0,
    'money_spent_online': 0,
    'money_spent_total': 0,
    'photos_videos': 0
}

I don't aim to get the best results possible, but rather to get a rough idea of how well this can work and I don't aim to explain any theory.

Exploration and Visualization

Google Data

Let's start by exploring some of the data that Google saved about me. In total, Google saved 2164 place visits.

All map data used for images © OpenStreetMap contributors

These are the top 5 places I visited repeatedly and the number of times I have been there:


El Viajero Sedentario         163
University of Seville          66
Torneo (San Lorenzo)           31
Quiero Dulce                   21
REWE                           20

...                          ... 

Giselastraße                    1
German post office              1
Gazzo                           1
Garching-Forschungszentrum      1
La Sagrada Familia              1

Google only saves the names of visited places, but doesn't categorize them. I tried categorizing by place name and what the name contains. This is very inaccurate, as lots of places won't have their category in their name. E.g., cafés often won't have "café" in their name.
What should be accurate is at least my home addresses, university, and airport visits. The airport visits don't all correspond to flights, but also include picking up or dropping off friends.
Here's the result:


Other            1258
Seville Home      483
Munich Home        74
University         73
Market             51
Cafe               42
Airport            40
Berlin Home        38
Train Station      33
Bar                32
Restaurant         23
Beach               9
Museum              8

Google can also save activities it recognizes like flying, walking, bus rides, etc.
This is what the first five entries of the activity data set look like (some coordinates censored):


activityType startTimestamp endTimestamp startLatitude startLongitude endLatitude endLongitude
473 IN_PASSENGER_VEHICLE 2022-09-01 02:20:28.243000+00:00 2022-09-01 03:04:01+00:00 ????? ????? 52.365605 13.508949
474 FLYING 2022-09-01 04:23:15.995000+00:00 2022-09-01 06:10:42.876000+00:00 52.365978 13.505437 50.044165 8.562142
475 FLYING 2022-09-01 07:18:53.480000+00:00 2022-09-01 10:23:56.245000+00:00 50.044132 8.562804 37.423092 -5.899401
476 IN_PASSENGER_VEHICLE 2022-09-01 10:57:11.553000+00:00 2022-09-01 11:16:23.291000+00:00 37.423862 -5.900755 37.381626 -6.000961
477 WALKING 2022-09-01 11:37:37.062000+00:00 2022-09-01 12:08:41.935000+00:00 37.382230 -6.001706 ????? ?????

So now it is possible to plot the activities on a map.

Flights

Here are all my flight connections:

Train Rides

All train rides:

Flights and Train Rides

Flights and train rides combined:

Bus and Train Rides

A few of the bus rides were mixed up with train rides, so I'll show train rides and bus rides combined:

Walking

For walking, here are heatmaps of where I walked most in Sevilla, first zoomed out and then closer:

For the place visits Google saves, it looks rather similar to the walking heatmap, as I mostly walked everywhere anyway:

My own data

Originally, the data I collected about myself contained rows for about every half hour. It recorded my coordinates, the location name on a city level, the country name, the number of Wi-Fi networks in reach, and a timestamp.
I aggregated this to one row per day for simplicity. Then I needed to assign a single city to each day, which leads to some places where I spent little time to not show up in the data.
Then I added more columns: I scraped my bank account for my payments history, processed my smartwatch data for heart rate, counted my photos and videos, and used a weather service for temperature and condition data.
As my phone broke and I didn't have one for a few days, I also filled in some missing data manually or, as in the case of steps for example, filled in the mean of the variable. This might introduce bias, but as there were only a few days missing, I think it's acceptable.
In the end I also added some augmented data, e.g., the day of the week, season, or a simplified weather condition.
This is, in short, how I arrived at the dataset I will use.

The final dataset contains 416 rows and 30 columns. I think the most interesting columns here are
date, city, max_temp, apparent_max_temp, weather_condition, steps, money_spent_offline, money_spent_online, wifi_count_mean, heart_rate_combined, photos_videos

All temperature values are in °C, money spent in €, and heart rate in bpm.

These are the first 10 rows (days) of the dataset:


weekday weekend month season city country max_temp apparent_max_temp temp_difference humidity weather_condition simplified_weather steps activity money_spent_offline money_spent_online money_spent_total wifi_count_mean watch_1_steps watch_2_steps watch_steps_combined watch_1_mean_heart_rate watch_2_mean_heart_rate heart_rate_combined watch_type photos_videos media_activity generic_lat generic_lon
date
2022-08-25 Thursday Weekday August Summer Berlin Germany 27.3 28.7 -1.4 Moderate Light drizzle Rain 10399 Moderate 7.20 0.00 7.20 9.25 NaN NaN NaN NaN NaN NaN NaN 0 NaN 52.52437 13.41053
2022-08-26 Friday Weekday August Summer Berlin Germany 26.9 29.2 -2.3 Humid Heavy rain Rain 3991 Low 0.00 0.00 0.00 11.57 NaN NaN NaN NaN NaN NaN NaN 1 Moderate 52.52437 13.41053
2022-08-27 Saturday Weekend August Summer Berlin Germany 23.0 25.1 -2.1 Humid Moderate rain Rain 2423 Low 0.00 0.00 0.00 2.79 NaN NaN NaN NaN NaN NaN NaN 1 Moderate 52.52437 13.41053
2022-08-28 Sunday Weekend August Summer Berlin Germany 21.9 20.8 1.1 Moderate Light drizzle Rain 2143 Low 0.00 0.00 0.00 9.46 NaN NaN NaN NaN NaN NaN NaN 5 Moderate 52.52437 13.41053
2022-08-29 Monday Weekday August Summer Berlin Germany 21.7 20.9 0.8 Moderate Light drizzle Rain 8250 Moderate 9.95 0.00 9.95 12.47 NaN NaN NaN NaN NaN NaN NaN 9 Moderate 52.52437 13.41053
2022-08-30 Tuesday Weekday August Summer Berlin Germany 20.9 20.4 0.5 Moderate Light drizzle Rain 6856 Moderate 15.95 0.00 15.95 7.83 NaN NaN NaN NaN NaN NaN NaN 0 NaN 52.52437 13.41053
2022-08-31 Wednesday Weekday August Summer Berlin Germany 21.3 19.7 1.6 Moderate Light drizzle Rain 10668 Moderate 0.00 0.00 0.00 21.10 NaN NaN NaN NaN NaN NaN NaN 0 NaN 52.52437 13.41053
2022-09-01 Thursday Weekday September Autumn Seville Spain 34.7 33.4 1.3 Moderate Mostly cloudy Cloudy 28784 High 76.96 0.00 76.96 21.83 NaN NaN NaN NaN NaN NaN NaN 30 High 37.38283 -5.97317
2022-09-02 Friday Weekday September Autumn Seville Spain 33.6 31.9 1.7 Moderate Partly cloudy Cloudy 13744 High 34.28 21.00 55.28 15.05 NaN NaN NaN NaN NaN NaN NaN 14 Moderate 37.38283 -5.97317
2022-09-03 Saturday Weekend September Autumn Seville Spain 32.1 31.0 1.1 Moderate Mainly clear Clear 35051 High 42.00 71.93 113.93 12.60 NaN NaN NaN NaN NaN NaN NaN 58 High 37.38283 -5.97317

I visited at least the following 28 cities. The count is the number of days I spent there.
There are cities missing, like Ronda for example, because I didn't spend a lot of time there. I didn't bother to fix this.


Seville                       223
Munich                         65
Berlin                         48
Cádiz                          21
Istanbul                        7
Milano                          6
Lisbon                          4
Madrid                          4
Valencia                        4
Barcelona                       4
Las Palmas de Gran Canaria      3
Vinaròs                         3
Merzouga                        3
Benidorm                        3
Cartagena                       2
Córdoba                         2
Alhama de Aragón                2
Fès                             2
Dénia                           1
Zaragoza                        1
Castelló de la Plana            1
Lagoa                           1
Granada                         1
Málaga                          1
Zuheros                         1
Barbate                         1
Jerez de la Frontera            1
Santiponce                      1

And these are the 6 countries I visited (on 3 continents, two of which I visited for the first time):


Spain       280
Germany     113
Turkey        7
Italy         6
Portugal      5
Morocco       5

The following are some summary statistics like top value per variable, mean, or min and max. I censored the money_spent variables as I will also do later on.


weekday weekend month season city country max_temp apparent_max_temp temp_difference humidity weather_condition simplified_weather steps activity money_spent_offline money_spent_online money_spent_total wifi_count_mean watch_1_steps watch_2_steps watch_steps_combined watch_1_mean_heart_rate watch_2_mean_heart_rate heart_rate_combined watch_type photos_videos media_activity generic_lat generic_lon
count 416 416 416 416 416 416 416.000000 416.000000 416.000000 416 416 416 416.000000 416 416.000000 416.000000 416.000000 416.000000 199.000000 168.000000 359.000000 199.000000 166.000000 358.000000 358.000000 416.000000 342 416.000000 416.00000
unique 7 2 12 4 28 6 NaN NaN NaN 3 12 4 NaN 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 NaN NaN
top Thursday Weekday September Autumn Seville Spain NaN NaN NaN Moderate Mainly clear Clear NaN Moderate NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN Moderate NaN NaN
freq 60 297 60 135 223 280 NaN NaN NaN 273 87 149 NaN 188 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 269 NaN NaN
mean NaN NaN NaN NaN NaN NaN 23.865144 22.922356 0.942788 NaN NaN NaN 11577.788462 NaN ????? ????? ????? 12.007359 11392.187500 10999.627329 11247.524547 71.333333 71.798742 71.540037 0.444134 17.302885 NaN 41.001306 0.05572
std NaN NaN NaN NaN NaN NaN 7.779006 9.021076 1.961471 NaN NaN NaN 7793.126558 NaN ????? ????? ????? 4.198189 7440.075266 7734.197996 7629.722817 4.250173 4.732582 4.519293 0.497565 37.645968 NaN 5.841556 9.06130
min NaN NaN NaN NaN NaN NaN -1.400000 -5.700000 -6.000000 NaN NaN NaN 107.000000 NaN 0.000000 0.000000 0.000000 0.000000 384.000000 74.000000 74.000000 62.000000 63.000000 62.000000 0.000000 0.000000 NaN 28.099730 -15.41343
25% NaN NaN NaN NaN NaN NaN 19.100000 17.650000 -0.100000 NaN NaN NaN 5637.750000 NaN ????? ????? ????? 8.937500 6163.000000 5181.500000 5597.500000 68.000000 68.000000 68.000000 0.000000 1.000000 NaN 37.382830 -5.97317
50% NaN NaN NaN NaN NaN NaN 24.400000 24.200000 0.900000 NaN NaN NaN 9872.000000 NaN ????? ????? ????? 11.265000 9943.000000 9039.000000 9440.000000 71.000000 71.798742 71.000000 0.000000 4.000000 NaN 37.382830 -5.97317
75% NaN NaN NaN NaN NaN NaN 29.600000 29.400000 2.100000 NaN NaN NaN 16053.500000 NaN ????? ????? ????? 14.622500 15540.500000 14017.000000 15096.500000 73.500000 74.750000 74.000000 1.000000 14.000000 NaN 48.137430 11.57549
max NaN NaN NaN NaN NaN NaN 43.300000 43.300000 6.300000 NaN NaN NaN 39452.000000 NaN ????? ????? ????? 34.750000 37912.000000 35623.000000 37912.000000 83.000000 88.000000 88.000000 1.000000 300.000000 NaN 52.524370 28.94966

Visualizations

Steps

Starting with the steps variable, here is a histogram of it:

Histogram of Step Count

and the daily step count over time:

mean: 11577, median: 9872

or the same information as a heatmap:

The day I took the most steps was the day of the Carnival of Cádiz. The night before we also were in a club so the night already contributed a lot:


weekday weekend month season city country max_temp apparent_max_temp temp_difference humidity weather_condition simplified_weather steps activity money_spent_offline money_spent_online money_spent_total wifi_count_mean watch_1_steps watch_2_steps watch_steps_combined watch_1_mean_heart_rate watch_2_mean_heart_rate heart_rate_combined watch_type photos_videos media_activity generic_lat generic_lon
date
2023-02-18 Saturday Weekend February Winter Cádiz Spain 19.1 15.1 4.0 Dry Light drizzle Rain 39452 High 16.14 0.0 16.14 13.15 37912.0 NaN 37912.0 78.0 NaN 78.0 0.0 24 Moderate 36.52672 -6.2891

Next we can also observe if the step count differs between different days of the week. To me, it looks like there is a significant difference between Saturday and the rest of the week.
There would be a statistical test necessary to find out how likely this difference would be to appear due to randomness, but I will leave this out because I spent way too much time on this whole thing already.

This is the comparison between the weekend and not weekend:

And the following is the comparison between different weather conditions. The following two graphs show the same information, but the second one contains simplified weather conditions which summarizes different strengths of the same weather condition (e.g. light rain and heavy rain are both rain).
As I would expect, the step count is higher when the weather is better.

We can also compare the step count between different cities. The following graph shows the mean step count per city.
The most mean steps are in cities where I only stayed for a short time because I walked around a lot to see as much as possible in the short time I had.

Temperature

Changing to another variable, here is the daily maximum and apparent maximum temperature per day:

The hottest day I had was in Seville when the temperature rose to 43.3°C. I'm sure there were hotter days even, but I also escaped to Cádiz sometimes.


weekday weekend month season city country max_temp apparent_max_temp temp_difference humidity weather_condition simplified_weather steps activity money_spent_offline money_spent_online money_spent_total wifi_count_mean watch_1_steps watch_2_steps watch_steps_combined watch_1_mean_heart_rate watch_2_mean_heart_rate heart_rate_combined watch_type photos_videos media_activity generic_lat generic_lon
date
2023-06-26 Monday Weekday June Summer Seville Spain 43.3 43.3 0.0 Moderate Clear Clear 7300 Moderate 187.41 164.62 352.03 15.72 NaN 8419.0 8419.0 NaN 67.0 67.0 1.0 4 Moderate 37.38283 -5.97317

I never experienced temperatures this high before. But, I was also surprised how it was less of a problem than expected. I think the reason for this is that the humidity was not that high.
This means the apparent temperature was similar to the actual temperature. We can plot the difference between the apparent and actual temperature. In this next graph, if the difference is positive, the apparent temperature is lower than the actual temperature.
From this I would conclude that the humidity is smaller. If the difference is negative I assume that the humidity is higher.
I'm not sure if this is completely correct under all circumstances (probably not, e.g., if there's a lot of wind), but I use it as an approximation for humidity.

With this logic, the driest day was on January 13, 2023 in Munich with a difference of 6.3°C:


weekday weekend month season city country max_temp apparent_max_temp temp_difference humidity weather_condition simplified_weather steps activity money_spent_offline money_spent_online money_spent_total wifi_count_mean watch_1_steps watch_2_steps watch_steps_combined watch_1_mean_heart_rate watch_2_mean_heart_rate heart_rate_combined watch_type photos_videos media_activity generic_lat generic_lon
date
2023-01-13 Friday Weekday January Winter Munich Germany 10.8 4.5 6.3 Dry Moderate drizzle Rain 13961 High 44.61 98.07 142.68 15.38 14159.0 NaN 14159.0 70.0 NaN 70.0 0.0 0 NaN 48.13743 11.57549

And the most humid day was on July 12, 2023 in Cartagena with a difference of -6.0°C. It was so humid there that you could feel the humidity in the air and on your skin. Sweating didn't have a cooling effect anymore. It also felt hotter than, for example, in Seville, although the actual temperature in Cartagena was way lower.


weekday weekend month season city country max_temp apparent_max_temp temp_difference humidity weather_condition simplified_weather steps activity money_spent_offline money_spent_online money_spent_total wifi_count_mean watch_1_steps watch_2_steps watch_steps_combined watch_1_mean_heart_rate watch_2_mean_heart_rate heart_rate_combined watch_type photos_videos media_activity generic_lat generic_lon
date
2023-07-12 Wednesday Weekday July Summer Cartagena Spain 34.0 40.0 -6.0 Humid Partly cloudy Cloudy 16871 High 68.48 8.0 76.48 10.63 NaN 17496.0 17496.0 NaN 70.0 70.0 1.0 16 Moderate 37.60512 -0.98623

Weather

Next we can look at the (simplified) weather conditions I observed. The most common weather condition was clear, closely followed by rain. The few snowy days were probably mostly from the time I spent in Munich:

Looking only at the weather while I was in Seville, it was mostly clear, but there were also lots of rainy days. The longest period of consecutive days with rain was 7 days, starting on 2022-12-11.

In Munich, it was mostly rainy/snowy, because I was there mostly in winter:

From the days I spent in Cádiz there were also quite a few rainy days:

Photos/Videos

Next we can look at the number of photos/videos taken per day. The following graph shows the number of photos/videos taken per day.

mean: 17.3, median: 4

The day I took the most photos/videos was November 20, 2022 in Córdoba. I took 300 photos/videos that day.


weekday weekend month season city country max_temp apparent_max_temp temp_difference humidity weather_condition simplified_weather steps activity money_spent_offline money_spent_online money_spent_total wifi_count_mean watch_1_steps watch_2_steps watch_steps_combined watch_1_mean_heart_rate watch_2_mean_heart_rate heart_rate_combined watch_type photos_videos media_activity generic_lat generic_lon
date
2022-11-20 Sunday Weekend November Autumn Córdoba Spain 16.4 15.5 0.9 Moderate Mostly cloudy Cloudy 23850 High 131.1 4.99 136.09 11.8 26638.0 NaN 26638.0 80.0 NaN 80.0 0.0 300 High 37.89155 -4.77275

I took at least one photo/video on 342 out of 416 days which is around 82.21% of all days.
In total, I took 7198 photos/videos which is an average of 17.3 photos/videos per day, however the median is 4.

There is also an observable difference between different weather conditions again. The following graph shows the mean number of photos/videos taken per day for different (simplified) weather conditions.
I again won't do a statistical test.

Wi-Fi

Next up is the mean Wi-Fi count per day which is plotted in the following graph. The mean Wi-Fi count is the mean count of all the Wi-Fi networks my phone detected in the same day.
I imagined that this could be an indicator of how civilized my location is.

mean: 12, median: 11.3

The day with the most Wi-Fi networks detected was April 7, 2023 in Barcelona with a mean of 34.75 Wi-Fi networks detected.


weekday weekend month season city country max_temp apparent_max_temp temp_difference humidity weather_condition simplified_weather steps activity money_spent_offline money_spent_online money_spent_total wifi_count_mean watch_1_steps watch_2_steps watch_steps_combined watch_1_mean_heart_rate watch_2_mean_heart_rate heart_rate_combined watch_type photos_videos media_activity generic_lat generic_lon
date
2023-04-07 Friday Weekday April Spring Barcelona Spain 18.9 17.0 1.9 Moderate Mainly clear Clear 26486 High 34.06 0.0 34.06 34.75 26786.0 NaN 26786.0 72.0 NaN 72.0 0.0 74 High 41.38879 2.15899

Heart Rate

In October 2022 I also started measuring my heart rate. In May 2023 I switched to a different device.

mean: 71.5, median: 71

From the plot there is already a day that especially sticks out: June 3, 2023. On this day I had a mean heart rate of 88 bpm. It was the day I traveled from Seville to Munich to surprise my girlfriend on her birthday.
One could say that my heart rate could have been higher because I traveled on that day and because I was very active, but I also had lots of other days with these circumstances where my heart rate didn't stick out like this.
I think it's super cute to see that my heart rate was noticeably higher because I was so excited.


weekday weekend month season city country max_temp apparent_max_temp temp_difference humidity weather_condition simplified_weather steps activity money_spent_offline money_spent_online money_spent_total wifi_count_mean watch_1_steps watch_2_steps watch_steps_combined watch_1_mean_heart_rate watch_2_mean_heart_rate heart_rate_combined watch_type photos_videos media_activity generic_lat generic_lon
date
2023-06-03 Saturday Weekend June Summer Munich Germany 21.0 19.7 1.3 Moderate Mainly clear Clear 16038 High 93.50 0.0 93.50 19.25 NaN 13295.0 13295.0 NaN 88.0 88.0 1.0 4 Moderate 48.13743 11.57549

The most relaxing day was on February 25, 2023 in Seville with a mean heart rate of 62 bpm.
I also happen to know what I did that day: I went to my favorite café, El Viajero Sedentario, for two and a half hours in the morning and afterward I relaxed on the rooftop enjoying the weather that was slowly getting warmer. Although there were a few rain drops on that day, there was also a long time of sun I could enjoy.


weekday weekend month season city country max_temp apparent_max_temp temp_difference humidity weather_condition simplified_weather steps activity money_spent_offline money_spent_online money_spent_total wifi_count_mean watch_1_steps watch_2_steps watch_steps_combined watch_1_mean_heart_rate watch_2_mean_heart_rate heart_rate_combined watch_type photos_videos media_activity generic_lat generic_lon
date
2023-02-25 Saturday Weekend February Winter Seville Spain 15.2 11.6 3.6 Dry Light drizzle Rain 1975 Low 9.9 0.0 9.9 7.77 1887.0 NaN 1887.0 62.0 NaN 62.0 0.0 1 Moderate 37.38283 -5.97317

Money Spent

I tracked spending money "offline" and "online". Offline here means that I paid in person with my card and online means that I paid online with my card.
I tried only paying with card so most of my payments should be covered.

Comparing total money spent between different weather conditions we can see that the mean total money spent per day is the highest on clear days and on Sundays.
Y-scales are censored on purpose.

Y-scales are censored on purpose.

Y-scales are censored on purpose.

Offline Money Spent

For the offline spending the largest payment (excluding ATM withdrawals) was a flamenco show group evening at a bar that I prepaid for my friends.


weekday weekend month season city country max_temp apparent_max_temp temp_difference humidity weather_condition simplified_weather steps activity money_spent_offline money_spent_online money_spent_total wifi_count_mean watch_1_steps watch_2_steps watch_steps_combined watch_1_mean_heart_rate watch_2_mean_heart_rate heart_rate_combined watch_type photos_videos media_activity generic_lat generic_lon
date
2023-05-26 Friday Weekday May Spring Seville Spain 28.1 27.4 0.7 Moderate Moderate rain Rain 10283 Moderate ????? 13.9 ????? 12.76 NaN 10209.0 10209.0 NaN 73.0 73.0 1.0 23 Moderate 37.38283 -5.97317

Online Money Spent

The largest online payment was prepaying for our vacation rental in Cádiz.


weekday weekend month season city country max_temp apparent_max_temp temp_difference humidity weather_condition simplified_weather steps activity money_spent_offline money_spent_online money_spent_total wifi_count_mean watch_1_steps watch_2_steps watch_steps_combined watch_1_mean_heart_rate watch_2_mean_heart_rate heart_rate_combined watch_type photos_videos media_activity generic_lat generic_lon
date
2023-05-07 Sunday Weekend May Spring Seville Spain 32.1 31.0 1.1 Moderate Clear Clear 24232 High 28.4 ????? ????? 11.63 NaN 21945.0 21945.0 NaN 85.0 85.0 1.0 150 High 37.38283 -5.97317

At some point I also had to buy a new phone because I destroyed my old one while jumping around to music with my housemates. This was the second most expensive day.


weekday weekend month season city country max_temp apparent_max_temp temp_difference humidity weather_condition simplified_weather steps activity money_spent_offline money_spent_online money_spent_total wifi_count_mean watch_1_steps watch_2_steps watch_steps_combined watch_1_mean_heart_rate watch_2_mean_heart_rate heart_rate_combined watch_type photos_videos media_activity generic_lat generic_lon
date
2022-10-15 Saturday Weekend October Autumn Seville Spain 28.9 28.2 0.7 Moderate Clear Clear 6076 Moderate 8.28 ????? ????? 11.42 NaN NaN NaN NaN NaN NaN NaN 0 NaN 37.38283 -5.97317

And the day of the Barcelona vacation rental and flight payments comes after that.


weekday weekend month season city country max_temp apparent_max_temp temp_difference humidity weather_condition simplified_weather steps activity money_spent_offline money_spent_online money_spent_total wifi_count_mean watch_1_steps watch_2_steps watch_steps_combined watch_1_mean_heart_rate watch_2_mean_heart_rate heart_rate_combined watch_type photos_videos media_activity generic_lat generic_lon
date
2023-03-05 Sunday Weekend March Spring Seville Spain 16.2 14.6 1.6 Moderate Light rain Rain 1972 Low 17.28 ????? ????? 10.22 2076.0 NaN 2076.0 65.0 NaN 65.0 0.0 2 Moderate 37.38283 -5.97317

Correlations

Let's look at correlations between the different variables. The following correlations are the most interesting ones for me:


steps, heart_rate_combined: 0.62***
steps, photos_videos: 0.46***
photos_videos, heart_rate_combined: 0.37***
temp_difference, apparent_max_temp: -0.70***

Interpretation: Usually, on days I took more steps I also had a higher heart rate. This is very expected because I was more active on these days.

Also, on days I took more steps I also took more photos and videos. I like taking photos so this is also not surprising.

On days I took more photos I also usually had a higher heart rate, but this correlation is rather weak.

Finally, on days with a higher difference between the max and apparent max temperature there was usually a lower apparent max temperature. I think a reason for this can be that I was mostly in dry-ish climates.

Unlike in the previous data post, there are no significant correlations with money_spent anymore.
That may be a less interesting result now, but it's good because it increases the potential for more accurately predicting my location later.

Here are two example plots showing the weak correlation with money spent (money scales are censored on purpose):

One could also expect a correlation between the max temperature and the number of steps. However, this correlation is very weak (0.25):

Linear Regression

My goal with linear regression here is not to make accurate predictions, but to find interesting relationships between the variables.
Of course, these relationships are not causal.
Also, I didn't test the assumptions of linear regression here, so the results are not very reliable.

Steps as independent variable

First I run a linear regression with the number of steps as the dependent variable. Through backwards elimination of independent variables that are insignificant at the 1% level I end up with the following model.
apparent_max_temp, wifi_count_mean, and photos_videos are the only significant independent variables.
An interpretation of the result would be that if the apparent max temperature increased by 1°C, I took on average 209.36 more steps, holding the other variables constant.
If the mean Wi-Fi count increased by 1, I took on average 362.81 more steps, holding the other variables constant.
If the number of photos and videos I took increased by 1, I took on average 89.26 more steps, holding the other variables constant.


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  steps   R-squared:                       0.299
Model:                            OLS   Adj. R-squared:                  0.294
Method:                 Least Squares   F-statistic:                     58.49
Date:                Sun, 24 Dec 2023   Prob (F-statistic):           1.61e-31
Time:                        13:37:10   Log-Likelihood:                -4243.7
No. Observations:                 416   AIC:                             8495.
Df Residuals:                     412   BIC:                             8512.
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const               877.9449   1305.529      0.672      0.502   -1688.385    3444.274
apparent_max_temp   209.3622     35.732      5.859      0.000     139.123     279.601
wifi_count_mean     362.8066     76.920      4.717      0.000     211.603     514.011
photos_videos        89.2580      8.573     10.412      0.000      72.406     106.110
==============================================================================
Omnibus:                       40.938   Durbin-Watson:                   1.468
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               54.677
Skew:                           0.725   Prob(JB):                     1.34e-12
Kurtosis:                       4.027   Cond. No.                         177.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Photos/Videos ~ steps

Next, I run a linear regression with the number of photos and videos as the dependent variable and just the steps as the independent variable.
As noted in note [2], there seem to be some problems with the model. Like I said before, I didn't test the assumptions of linear regression here, so the results are not very reliable.

Nevertheless, the model estimates that if my number of steps increased by 1, the number of photos and videos I took increased by 0.0022 on average.
The negative intercept is significant, but has no practical meaning in this case.


                            OLS Regression Results                            
==============================================================================
Dep. Variable:          photos_videos   R-squared:                       0.208
Model:                            OLS   Adj. R-squared:                  0.206
Method:                 Least Squares   F-statistic:                     108.4
Date:                Sun, 24 Dec 2023   Prob (F-statistic):           1.05e-22
Time:                        13:37:11   Log-Likelihood:                -2050.7
No. Observations:                 416   AIC:                             4105.
Df Residuals:                     414   BIC:                             4114.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         -8.1747      2.949     -2.772      0.006     -13.971      -2.379
steps          0.0022      0.000     10.412      0.000       0.002       0.003
==============================================================================
Omnibus:                      428.657   Durbin-Watson:                   1.399
Prob(Omnibus):                  0.000   Jarque-Bera (JB):            15285.273
Skew:                           4.624   Prob(JB):                         0.00
Kurtosis:                      31.219   Cond. No.                     2.50e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.5e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Heart Rate ~ steps

Let's look at the relationship between the heart rate and just the number of steps.
The model estimates that if my number of steps increased by 1, my mean heart rate for that day increased by 0.0004 bpm on average.


                             OLS Regression Results                            
===============================================================================
Dep. Variable:     heart_rate_combined   R-squared:                       0.387
Model:                             OLS   Adj. R-squared:                  0.385
Method:                  Least Squares   F-statistic:                     224.3
Date:                 Sun, 24 Dec 2023   Prob (F-statistic):           1.14e-39
Time:                         13:38:03   Log-Likelihood:                -960.01
No. Observations:                  358   AIC:                             1924.
Df Residuals:                      356   BIC:                             1932.
Df Model:                            1                                         
Covariance Type:             nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         67.3933      0.334    201.593      0.000      66.736      68.051
steps          0.0004   2.41e-05     14.977      0.000       0.000       0.000
==============================================================================
Omnibus:                       49.215   Durbin-Watson:                   1.443
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               81.853
Skew:                           0.822   Prob(JB):                     1.68e-18
Kurtosis:                       4.669   Cond. No.                     2.48e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.48e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

heart rate ~ watch_type + steps

As I switched to a different watch in May 2023, I also wanted to see if the devices differ in their heart rate measurements.
I wasn't sure if I should include this section, because the included independent variables are very limited and I only tried linear relationships. Resulting R-squared values are rather low. Although these limitations might lead to very unreliable results, I'll still include this section to show my thinking.

Another way to ask if the devices differ in their measurement is if the watch type influences the measured mean heart rate.
For this I run a linear regression with the mean heart rate as the dependent variable and the watch type and the number of steps as the independent variables because I want to account for the influence of the number of steps.
The estimated model's watch_type coefficient is not significant, so the watch type doesn't seem to influence the measured mean heart rate (which would be good because if both are accurate there shouldn't be a difference).


                             OLS Regression Results                            
===============================================================================
Dep. Variable:     heart_rate_combined   R-squared:                       0.388
Model:                             OLS   Adj. R-squared:                  0.384
Method:                  Least Squares   F-statistic:                     112.3
Date:                 Sun, 24 Dec 2023   Prob (F-statistic):           1.61e-38
Time:                         13:38:25   Log-Likelihood:                -959.71
No. Observations:                  358   AIC:                             1925.
Df Residuals:                      355   BIC:                             1937.
Df Model:                            2                                         
Covariance Type:             nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const         67.2716      0.370    181.645      0.000      66.543      68.000
watch_type     0.2888      0.377      0.765      0.445      -0.453       1.031
steps          0.0004   2.41e-05     14.937      0.000       0.000       0.000
==============================================================================
Omnibus:                       47.203   Durbin-Watson:                   1.441
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               77.079
Skew:                           0.800   Prob(JB):                     1.83e-17
Kurtosis:                       4.614   Cond. No.                     3.31e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.31e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

heart rate ~ watch_type + photos_videos + apparent_max_temp

I also tried to use the number of photos/videos I took and the apparent max temperature as a proxy for how active I was on a day.
Here the watch type coefficient isn't significant either.


                             OLS Regression Results                            
===============================================================================
Dep. Variable:     heart_rate_combined   R-squared:                       0.161
Model:                             OLS   Adj. R-squared:                  0.154
Method:                  Least Squares   F-statistic:                     22.59
Date:                 Sun, 24 Dec 2023   Prob (F-statistic):           2.10e-13
Time:                         13:38:28   Log-Likelihood:                -1016.1
No. Observations:                  358   AIC:                             2040.
Df Residuals:                      354   BIC:                             2056.
Df Model:                            3                                         
Covariance Type:             nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
const                69.1102      0.579    119.364      0.000      67.972      70.249
watch_type           -0.0576      0.515     -0.112      0.911      -1.071       0.956
photos_videos         0.0408      0.006      7.338      0.000       0.030       0.052
apparent_max_temp     0.0784      0.028      2.803      0.005       0.023       0.133
==============================================================================
Omnibus:                       33.655   Durbin-Watson:                   1.559
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               43.765
Skew:                           0.699   Prob(JB):                     3.14e-10
Kurtosis:                       3.990   Cond. No.                         122.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Goal 2: Prediction

Now let's get to the second part: predicting my location on a city level based on my behavior.
I filtered out all days I stayed in a city that I stayed in for less than 7 days. This leaves 5 possible cities to predict: Seville, Munich, Berlin, Cádiz, and Istanbul.
Then I applied encoding and standardization where necessary/applicable.

My goal here isn't to produce the absolute best model, but rather to try out different approaches and see what might work better than other approaches here.

Neural Network

The first approach I tried was to use a neural network. For this I also used SMOTE to balance the classes.

NN: First Try

Just for the first neural network try I used all variables except variables coming from the watch, or variables directly hinting at my location (coordinates and country).
Without much trying around I ended up with the following architecture:
Three dense layers, two of which with a ReLU activation, and the last one with a softmax activation.


Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense (Dense)               (None, 64)                3776      
                                                                 
 dense_1 (Dense)             (None, 32)                2080      
                                                                 
 dense_2 (Dense)             (None, 5)                 165       
                                                                 
=================================================================
Total params: 6,021

I used the Adam optimizer, a categorical cross entropy loss, and accuracy metrics. After training for 10 epochs I got the following results:

Orange: train accuracy
Blue: validation accuracy
0.50.550.60.650.70.750.80.850.90.951-1012345678910

Orange: train loss
Blue: validation loss
00.10.20.30.40.50.60.70.80.911.11.21.3-1012345678910

NN Second Try: Only Behavioral Variables

For the second try with a neural network I only left the behavioral variables (steps, money spent, photos/videos).
This was my goal from the beginning.
I also tried out a different architecture with one ReLU layer, one softmax layer, and L2 regularization.


Model: "sequential_3"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_7 (Dense)             (None, 64)                704       
                                                                 
 dense_8 (Dense)             (None, 5)                 325       
                                                                 
=================================================================
Total params: 1,029
Trainable params: 1,029
Non-trainable params: 0
_________________________________________________________________

Orange: train accuracy
Blue: validation accuracy
0.20.250.30.350.40.450.50.550.60.65-1012345678910

Orange: train loss
Blue: validation loss
1.31.41.51.61.71.81.922.1-1012345678910

Here we end up with a pretty bad validation accuracy. It also looks a lot like the model is overfitting.
I didn't try to improve this a lot because of time reasons and because I think the most promising way to fix this would be to have more data which I can't do anyway unless I wait for multiple years.

New example not included in dataset

Nevertheless, the model can already correctly predict the city of my test day:

December 19, 2023


example = {
    'steps': 2575,
    'money_spent_offline': 0,
    'money_spent_online': 0,
    'money_spent_total': 0,
    'photos_videos': 0
}

Actual: Munich
Predicted: Munich

However, with a validation accuracy this low, this might as well be just a random guess.

KNN

The next method I tried was KNN, where the highest accuracy I got was 0.42 with K=3.
This is already better than the NN, but I suspected the next methods would lead to better results.

Random Forest

For random forest I did a random search for the best parameters.
Here I ended up with an accuracy of 0.65.

XGBoost

And for XGBoost I got an accuracy of 0.69 (nice).
With a random search I even found parameters that lead to an accuracy of 0.7. This appears to be the best model I could find without investing more time into this.
And I think it's already pretty good. It can also correctly predict the city of my test day:

New example not included in dataset

December 19, 2023


example = {
    'steps': 2575,
    'money_spent_offline': 0,
    'money_spent_online': 0,
    'money_spent_total': 0,
    'photos_videos': 0
}

Actual: Munich
Predicted: Munich

But of course, it's not always correct:
December 05, 2023


example = {
    'steps': 3579,
    'money_spent_offline': 14.1,
    'money_spent_online': 0,
    'money_spent_total': 14.1,
    'photos_videos': 0
}

Actual: Munich
Predicted: Berlin

For this day it got the city wrong, but at least it was still in the same country.

Counting how often a feature is used to split the data across all trees, we can see that the number of steps is the most important feature, followed by money spent offline, the number of photos/videos taken, and money spent online. 'activity' and 'media_activity' are just augmentations I derived from the number of steps and photos/videos.

Conclusion

The XGBoost model turned out to work best for this work. As I didn't try improving much on the neural network, further exploring might lead to better results. Also, the accuracy metric I used could be complemented by other metrics.
I could have incorporated more behavioral features like which transportation modes I used on the day, but I didn't want to spend too much time on this project.
For now, I'm very content with the result already.

It shows that my behavior changes enough between different cities to be able to predict the city I'm in. It is not applicable to anyone besides myself, but I suspect that lots of people would have similar results if they were to try this.

Although doing this was fun, I'm not sure if there are any practical applications for the prediction of my location, as every device has GPS built in and knows its location.
As GPS usage is rather battery-consuming, there might be use cases for predicting the location without GPS usage. As this is done on a city-level, it might be useful for, e.g., weather services.
At least this post might inspire similar fun personal projects for others which can help to learn about data collection, exploration, and prediction.
I wouldn't continue trying to improve this model, but I can imagine to keep collecting my data for further explorations and visualizations over the years.

I hope you enjoyed this little project.
What is left to write for me is a general summary post of the whole time.