Me

My photo
Developer, Data Enthusiast, DFW

Sunday, July 24, 2016

Interpretations of Howl's Moving Castle (2005 - Hayao Miyazaki)

Howl's Moving Castle was directed by Hayao Miyazaki a Japanese artist whose work known to be symbolic, controversial, and magnificent. If I may add, in some of his movies, if his contents can be interpreted the way he wanted, they're very realistic and tells us not just the story of the movie, but a reflection of life that we can really relate to - such as love and hardship to something more broad like politics, industrialization, war, technological advances. Miyazaki's work is the kind of animation that needs to be digested slowly because he uses a lot of allegory and metaphorical effects to express his opinions. Not just in the graphics itself - I felt the quotes his characters uses, and actions they take were extremely interesting.

If I may add, I am a saturated anime fan myself, but only Miyazaki's movie was able to lead me to write a blog about it. I think it's because his movies engages the audience in a different way. This weekend I watched two of them. Both of which does not have a clear "good guy", "bad guy" theme. The movies itself sometimes ends in an abrupt way, as if after it finishes giving a specific type of message, Miyazaki ends the story.

But anyways, some of his movies includes:

Spirited Away (2005)
Princess Mononoke (1997)
My Neighbor Totoro (1998)

Unfortunately it was announced that he's going to retire from the animation industry. Right now, his animation efforts are more like a hobby, but no real commitment. Though Ghilbi - the studio that publishes his work is still around, I don't know if they would produce something as savory as Miyazaki's work. But anyways, below are couple of my thoughts about Howl's Moving Castle. Note that there is a lot of spoiler here. I recommend watching the movie first and then come here to check out if your interpretation was as great as mine. The main joy of watching Miyazaki's film perhaps is scrutinizing his work.


Settings

The story seem to be set somewhere in the British Empire which already had highly technological advances, but not completely modern. One of the
Movie Starts off with the Brits winning some kind of War. There's a lot of flags here, so the war might be WWI or WWII
Brits celebrating some kind of war.
Could be WWI or WWII.
protagonist, Sophie is a young girl who runs at a shop that sells hats. She already stands out because she dresses very plain, if not humble clothing in a fashion industry whereas her friends and mother seems more extroverted and fashionable. Sophie also has low self esteem of herself and claims she not beautiful even though she is.


Sophie

The beginning also showed she's not interested in relationships - especially when her friends mentioned Howl is this mysterious being that takes beautiful women's heart. She doesn't think like a typical young girl and her life style doesn't correlate with her current job. It looked as though she's this character being drawn by her environment. Later on, she becomes the main driving force of the story after the Witch of the Waste places a curse that turns her into an old hag.

Yup. That's plain old Sophie walking away from the crowds.


Youth

She sets out on an adventure to have her curse lifted by a wizard named Howl whom the Witch of the Waste so obsessively loves. It was very amusing to see her quickly adapted to condition. Having  sarcastic attitude while interacting with other side characters she encounters as she journeys to the castle.
She was glad of not being picked on by men who are interested in her youth, and she's even offered assistance to travel somewhere after being turned into an old hag.
I felt this is a great message depicting youth today, not just women. Too many times the society expects certain things from us and it's difficult if we go against the social norm.
Sophie comments her outfit fits her better than before

The Moving Castle

Sophie was able to find Howl's  Moving Castle with the assistance of a mysterious Scarecrow she pulled out of a bush. Note that this is also the title with allegory meanings. The castle itself looks like a bunch of weird iron junks patched together and is moving with four chicken legs. Apparently this is the dwelling of the infamous wizard, Howl. The literal meaning in this case is the castle and where Howl lives. The symbolic meaning is the stability of the inner world of Howl. Since the castle is patched together and doesn't look very stable, it shows Howl actually is a fragile person. If there is a third meaning, I think the castle also represents Howl's heart. Sophie's being able to enter his castle is significant. Later on in this story, she even cleans this castle and eventually destroys the whole castle when she took the fire demon Calcipher out of the castle. This exposed Howl's weakness as someone who gave his heart to the fire demon in exchange with great power.

Howl's Moving Castle

What does Calcipher Represents? 

Howl's Younger Self
Since we're talking about Calcipher let's analyze his role a little bit. Calcipher runs Howl's entire moving castle his presence has been to stay in the castle the whole time. But maybe he was meant for something else. There's a scene later in the story where Sophie meets Howl's younger self using a ring. In the scene, there were multiple shooting stars flying towards Howl. Some of which doesn't reaches him eventually fades away. Howl caught one particular shooting star and immediately swallows it. That one was Calcipher. I felt those shooting stars were different ambitions or goals that developed  Howl early on as a child to what he is in the present story. Similarly to ourselves, when we were young, we fantasized the goals and ideals we wished to pursue. Maybe it doesn't suit us, realistically, which create a weak inner world of ourselves as we grow up. Calcipher represents that ideal but somehow manifested for a minuscule job of moving the castle as an engine that grants Howl's mobile fragile freedom.





Sophie Transforms Young and Old 

Throughout the film, Howl never mentioned about lifting Sophie's curse or fixing it directly. I mean, this was the main driving force that made Sophie leaving out of her hat store to an unknown yet exciting adventure. (Which she seemed to have no problem doing at all) I was reading another person's interpretation about this. A fellow programmer said it could be Sophie's "special power". I'm going to a different route claiming the curse that the Witch of the Waste casts on Sophie was not a spell that turns her into 90 years old. It was a spell that projects a character's visualization of themselves into their actual appearance. I think of it as a measure of energy vs negativity, love vs solitude, high self esteem vs low confidence. It somewhat explains why Sophie sometimes turns young when she talks about Howl. And when she sleeps, she turns into a younger person too, but reverts back as soon as she's awake. (Because when she is sleeping, she's not projecting anything at all) Similarly in our world, maybe it's not a real curse. Some of us may have low self esteem but we're in fact extremely capable. The world we project upon ourselves affects our daily lives.















Moral?  

One of my favorite quotes of the story
I felt the ending came quickly as soon as Sophie returns Howl's heart and brings him back to life. The international war just ended by all parties at the same time with no obvious reason. Everyone was like, "Okay lets end this foolish war now..." As cheesy as this may sound, I think Miyazaki might be trying to say "We often create worldly problems ourselves. With a bit of love and not taking short cuts, we can come to an happy ending". Though I'm not to sure if the story was supposed to talk about WWI or WWII, both of the wars in which the United Kingdom was bombed by zeppelins. Note that there is also another version of this story written by Diana Wynne Jones whose setting is completely different. Well, I hope my commentators can enlighten this part for me. Please do bring any other thoughts or insights.
Madame Suliman just
REALIZED this is a foolish war. 






Technically Sophie kissed everyone and that fixed everything...
Thank you, Howl.



Monday, June 20, 2016

Thoughts about MITx: 15.071x The Analytics Edge on EDX

I have FINALLY completed this course on edX. Aside from I Heart Statistics, this is by far one of the best statistics course I have ever taken in my life - and I encourage everyone who is getting into Data Science to check it out as well as the materials are still very relevant. Some of the things it went over were:

  • Linear Regression
  • Logistic Regression
  • Decision Trees
  • Random Forest 
  • Text Analytics (Bag of Words)
  • Clustering
  • Visualization
  • Linear Optimization
  • Integer Optimization
  • Kaggle
Of course, all of these were taught in R, which forced me to learn R. I don't think it is possible to complete these homework units without R due to a command called set.seed() in R which defines the algorithmic pattern of certain random selection/generation functions used by R packages. So for those who wants to complete this course in Python, you have been warned. 

Now to the meat. I was extremely impressed on how the topics were presented in this course. The videos were on average 5 minutes long each were succinct enough to teach about individual topics with real world examples. One of the worst way of learning is learning without relevant applicability, and this course has done a great job in this realm. 

For example, I didn't know American Airlines used Linear Optimization to change their way of selling seats. This lead to the science of revenue management of framing the way seats should be sold in airlines where regular seats are sold enough to cover operating costs and discount seats are sold in order to generate income (this is looking at 10,000 feet point of view explanation). In 1985, American Airlines launched a program called "Ultimate Super Saver". This move competed with People Express, another major airline, and forced them into bankruptcy due to competitive pricing strategies. 

"Donald Burr, the CEO of People Express, is quoted as saying "We were a vibrant, profitable company from 1981 to 1985, and then we tipped right over into losing $50 million a month... We had been profitable from the day we started until American came at us with Ultimate Super Savers." 
https://en.wikipedia.org/wiki/Yield_management 

What did I gained from taking this course?
  • Understanding R vs Python 
  • Learning how to frame problems 
  • Conducting test and validating using training sets and testing sets 
  • Usage of different techniques based on use case and what are some general strengths/weaknesses of each techniques
Over the course of next few weeks, I'll be re-using some concepts I learned from this course and going into details of building models. Stay tuned!



Tuesday, April 12, 2016

Ode to Solitude


You're a compulsory condition
Though black is your initial appearance,
rainbow is in your depth
Loneliness,Eagerness,Distaste,Jealousy
Freedom,Gratitude,Love,Content
What a Dilemma!
If you're my enemy in short term decisions,
You're my friend in longer ones

All in all, if I can keep it together
What comes about, I hope is
Realization
It just so happens that some of us spends more time alone
to Realize something out of all those emotions
Through you, I gain new destinations
Through you, I gain invincibility, immunity, and finally maturity
Through you, I become more of me than what I want others to see
Last but not least, through you, I lose social cadence
Oh Solitude! We have such a love and hate relationship!
Though if I do get into a real intimate relationship,
I hope I won't forget all the things you taught me
In the end I think the gains outweighs the loss


-s

Sunday, April 10, 2016

Weather Data Analysis - Part II

Oakland Weather Pattern with T-Stats.


This is part II of my previous posting regarding Oakland's weird weather despite Global Warming. You can go ahead and read what I've written there (which is a jumble of IPython Notebook code that I can no longer change because Blogger is acting weird. If you've seen it already skip the following paragraph and move to the next one.)

To sum it up, this analysis began because I've been living in the Oakland area for two years and five months and haven't felt the weather was getting warmer year after despite the amount of severity I've been hearing about Global Warming and how the average temperature has been climbing 1.4 Fahrenheit every year. So these postings was really to check up on those information and see some future implications. I got my hands on some weather data and began comparing years 2014 and 2015. From T-Test Statistics, it looks like there wasn't a significant difference with alpha of 0.05. The T-Test just says that I could be complaining about the cold too much and the two years were actually very similar in terms of means. I also concluded that Global Warming is not making Oakland's weather warmer, but more volatile. At the time I only had Year 2014 and Year 2015 data, and based on some graphs I created, that was the conclusion I made. 

For those that aren't too familiar with T-Statistics, it's a quick mathematical test to check how similar a sample is against another sample. There are couple kind of T-Test. The one I'm using in this and the previous post is called Student T-Test or Welch's Test. Here's a good video link that describes T-Test in general without getting too much into details.  


In this post, I prepared an average temperature weather data from years 2000-2015. First I need to take the statement back about Global Warming was making bay area's weather volatile as concluded in Part I. In fact, I think we should take out the whole Global Warming contribution factor and reassess the current state of our Oakland weather pattern. There's a need to establish a base expectations before comparing other factors. My box plot below already disproves that global warming is making the weather more volatile over the years because there isn't any kind of observable pattern. If volatility increased over years, then the whiskers for recent years should be longer, but as shown this is not the case. Year 2008 actually seems more volatile then Year 2000 and 2015. As far as box-plot goes, Oakland weather is still mostly in the mid 50s to mid 60s Fahrenheit every year.




My research question is, "Is Oakland's Weather Changing Over the Years? If we can't observe this pattern, year by year basis, can we see the differences every five years or so?" Why am I asking this question? I've been looking at various articles online published by well known resources that part of the characteristics of Global Warming is the exponential rate increase of temperature. Over the next decade average global temperature will rise 2.5%-10% due to the amount of human caused C02 emissions into the atmosphere. I'm not disagreeing with the information provided by the IPCC  or the National Geographics. Though I'd like to know how Oakland's weather is changing assuming it is being effected.

Hypothesis

My null hypothesis is in the recent years, there is no significant differences of recent Oakland Weather Temperatures against its past.
My my alternative Hypothesis is... there is.
Alpha = 0.05

Below is the T-Stat and two tail P-value for Year 2000 and Year 2015.
Note that I am skipping through the calculation details very quickly as I'm using Python libraries to do the computation. There will be a link below showing my notebook. 


Year 2000 and Year 2015: T_Statistics:2.46825543862    P-Value:0.0138134155695

Since 0.05 is my baseline of determining if there's a difference, and receiving a  P-value of 0.014, this means year 2000 and 2015 are very different.

Let's look at the Kernel Density Estimation Graph for this.

(KDE is almost like Histogram graphs, except they're smoothed out. They're created by estimating between one data point to the next one. Density is the estimated probability. To put into context, in year 2000, there's a density of 0.038 for 50 Fahrenheit. For a given day out of 365 days, there's a 3.8 percent chance that the temperature will be around 50 Fahrenheit. 
It can be effective when there's a lot of data. This solves outlier issues which is the only drawback I see in using KDE over histograms. And yes, there is an actual formula for it. https://en.wikipedia.org/wiki/Kernel_density_estimation) 
Plus an url evaluating Histogram vs KDE



From the graph, it would appear year 2015 had a warmer year because there were a lot more concentration shifted to the right and the model is a lot skinnier. We're working on a complicated question and just using the two years wouldn't justify any kind of conclusion. So I went ahead and plotted the KDE for several more years that were significant. Then I discovered something interesting relative to the shapes of various years.



To view the KDE over 15 years individually, it is available here

A quick examination of the KDE shapes, it seems to be transforming in a predictable pattern. In Year 2000, it looks like an uneven camel hump shape with more warmer temperatures density concentration. In 2008, it looks almost like a perfect normal model. In the late years, the uneven camel hump came back again, but this time more concentration on the colder side of the temperatures. Following this pattern, it  may seem that the prediction for year 2016,  the camel hump will be present and perhaps the model will be slightly more spread out as supposed to 2015. To me, this graph does says average temperature is increasing since the model in later years are skinnier and shifted more to the right. But within a year, there's more days that might felt colder than days that are warmer. (More days in the low 50s degrees then days in the high 60s degrees Fahrenheit.)

The next thing I'm trying to figure out is, "If I'm in Year 2000, how many years do I need to wait to experience the temperature differences?" To answer this problem, basically I compared a given base year against every other years using the same T-Test. This T-Test doesn't necessary says a person will "feel" the difference, but just a mathematical P-Value focus test.
Summing up the results, it can vary. Differences between every sequential year can be "felt" the following year, or couple years later. I'm disappointed to say, there's no trend. Or rather, my data is too simple because I'm just using average temperatures.

Let's turn our head to the other side and ask, "For a given base year, are the years before and after similar enough?" Answer is "maybe". Below are the highest P-Values for a given base year. Sometimes it hit the mark, sometimes it doesn't. Year 2006 and Year 2010 supposedly have a P-Value of 97.5%.




Final conclusion

While the Global Warming effect can be felt in other places and even change ecosystems based on the geographic location, there's virtually no effect of it observed over 15 years in Oakland in terms of average temperature. While in the KDE graph  the weather temperatures seems to be shifting slightly to the right, there aren't any observable effects aside from that interesting camel hump forming in recent years.


Couple more graphs below for final thoughts on the curious audiences.








I think that's enough T-Statistics for one post. Next, I'll use a different kind!


Some other things to consider in the next post:
  • Use more dimensions of data. Ex: percipitation, humidity, pressure, sun rise and sun down time
  • Calculating drought in California using factors like precipitation and evaporation
  • Calculate water absorbency in C02
  • Anova, Manova, Regression? 


Please leave some comments! Harsh criticisms are welcomed.


Data provided by www.forecast.io
My Notebook. Specific data used will not be provided as it is owned by forecast.io. You can download them yourselves!

Citations:
Intergovernmental Climate Change. "The Consequences of Climate Change." Http://climate.nasa.gov/effects/. NASA, n.d. Web. 10 Apr. 2016.

National Geographic News. "Global Warming Fast Facts." National Geographic. National Geographic Society, 14 June 2007. Web. 10 Apr. 2016

"Rising Temperatures." Rising Temperatures. Http://wwf.panda.org/about_our_earth/aboutcc/problems/rising_temperatures/, n.d. Web. 10 Apr. 2016.



Monday, March 28, 2016

Finding Out All of your NAN and NAT rows in your Data Frame At Once

Coming into DataFrames, one of my biggest pain points was finding out where the NAN (Not A Number) or NAT (Not A Time) is in my plethora of data. I prefer not to drop these rows or ignore them during crucial calculations.

Sometimes I may want to find out exactly where these values are so I can replace them with more sensible numbers such as averages in a Time Series related data.


Pandas is such a powerful library, you can create an index out of your DataFrame to figure out the NAN/NAT rows. 

You can skip all the way to the bottom to see the code snippet or read along how these Pandas methods will work together. 


Let's say I have a DataFrame that I suspected having missing values.


This is an actual data for my Weather Data Set and I was expecting to have a count of 365 for the year 2015. The .count() method is great for detecting because it doesn't include  NAN or NAT values as a frequency by default.

Now to the meat. The pd.isnull() checks one by one if any of your cells is null or not and returns a boolean DataFrame.

Not useful yet, because we don't have the time to check through the True cases for every row. 

Next is the pd.DataFrame.any() method. This method looks for the specified axis (column-wise or row-wise) if there is at least one True case.



We're almost there. The above method will tell me which rows has null at least one null value. Now I want to see ONLY indexes that has the null values.


To see if it's working....



Now you know where the NAN/NAT values are in your DataFrame, what will you do? 
If you're too darn lazy, maybe you'll just use the DataFrame.interpolate() which fills in the differences of previous row and the next row. 


Code Snippet below, enjoy!:


df=pd.DataFrame({'col1':[1,2,np.NAN,3,4],'col2':['bam','boo','foo','john',pd.NaT]}) ##Create random data
nan_dex= pd.isnull(df).any(1).nonzero()    #Create index with nans
df.iloc[nan_dex]  #show me the nans

Sunday, March 27, 2016

Converting All Unix Time Stamp to DateTime in a DataFrame in One Run!

While Data Munging for my Weather Data Analysis blog, I skipped out a bunch of other valuable data which were all related converting to date_time. I managed to come across a solution today while playing with Pandas' .apply() and .applymap() functions.

Here's how the raw data looked like prior human readable date time conversion. 

To the non-developers, this 9 digit is also known as Unix Timestamp. Well what is Unix? It's an operating system, very similar to what's running in Mac.. In many systems, Unix Timestamp is the way to go for distributing date data because this is a non-expensive way of interpreting and also a safe way to distribute date information. There's a lot of ways to interpret dates such as:

  • 2013-12-31T 15:25:46Z
  • 12/31/2013 3:25.46.00
  • December 31, 2013 3:25.46pm
The ones I've shown below is just off of my head. Having worked with systems for awhile, horrible coding that doesn't parse dates really well can easily trip up from unexpected inputs. Furthermore, which region is this date time interpreting from? PST? GMT? EST?Unix Timestamp solves all of that because it is the difference of the current time from the epoch time of January 1st, 1970 UTC. More explained from the Wiki.


Now I have this set of data, there are only couple of columns that I want to actually convert to date time. I'm too lazy to convert one column at a time, so let's create  functions that can be applied to all the columns.

In this Tutorial, I'll show you how to create two functions that automatically figures out which columns are Unix Time stamp and then using .applymap() to use out of the box pd.to_datetime() functions to it. 


Below is the first function to figure out if a given input is a number or a string.

def is_numeric(obj):
    attrs = ['__add__', '__sub__', '__mul__', '__div__', '__pow__']
    return all(hasattr(obj, attr) for attr in attrs)

In Python, everything is an object. Particularly for numeric objects, which should have attributes to do them addition, subtraction, multiplication, division and exponentiation. A way to look at an object of what kind of attributes it has is using a dir() function.



Though, you can see only  '__add__' and '__div__', in this screen shot, feel free to try the code above in Python and you should be able to see the rest of the attributes I mentioned in the is_numeric() function defined above.



For Strings, it will have '__add__' and '__mul__', but that's it.


The next function is to figure out if our data is Unix Time stamp, or some other irrelevant numeric data such as the 'apparentTemperatureMax' column. The first thing I noticed is the number of characters in Unix Time stamp is a lot longer.

Timestamps of 10 characters has information all the way to the seconds. 11 characters are milli-seconds, and so on...

Combining all of those information above, below is the code. I'm going to use .applymap() which does the column checks and performs pd.date_time() conversion onto every column.




The .apply() function is meant for 1 column of data. Below is an example:


Code Snippet below in case you want to copy and paste. Please give me credit if done so!
def is_numeric(obj):
    attrs = ['__add__', '__sub__', '__mul__', '__div__', '__pow__']
    return all(hasattr(obj, attr) for attr in attrs)

#Example of MAP
def convert_dt(x):
    if is_numeric(x) and len(str(x))==10: #condition checks is the returned cell is numeric and has len of 10 so it guarantees we're converting date numbers only
        return pd.to_datetime(x, unit='s')
    elif is_numeric(x) and len(str(x))==9: #Change the amount of Length and which unit for pd.to_datetime accordingly
        return pd.to_datetime(x, unit='s')
    else:
        return x
daily_df.applymap(convert_dt)

Tuesday, March 1, 2016

Weather Data Analysis - Part I

Weather Data Analysis
So I was playing with IPython along with the holy trinity matplotlib, pandas, and numpy. I needed some data to really learn something here. It was March, 2016, and I was thinking, "Wow, it seems to be getting colder year after year in San Francisco! Sure don't feel the Global Warming yet!" I didn't have a whole year of data for 2016, so I went ahead and compared 2014 and 2015 temperatures for my Oakland zip code weather.
Came across a fairly good weather data related API online called http://forecast.io/. (IO is actually a top level domain name for British Indian Ocean Territory, but really cool concept for an API website.)
I had my data and did some analysis. Using 2 sample T-Test, it shows that there wasn't a significant difference between the two years. Not a surprise from first hand experience as it's always been around 40s degree to high 60s Farenheit. To really get an significant differences, the range of the weather would have to be wider, OR there is a big standard deviation from one month to another. Though my T-Test failed yielding a value of .57 P-Value with an alpha of 0.05, I was able to extrapolate from the graphs that between year 2014 and 2015, our weather temperatures aren't warmer in general, but slightly more volatile. (Perhaps I should compare year 2015 to something like year 2000. Will save that for Part 2)
My conclusion from this quick analysis is that Global Warming is not making Bay Area warmer, but is making the weather more volatile - this is proven from looking at the Box Plot, Density Plot, and Scatter Plot.
If you're using any of the code snippets, please do make a reference to this page!
In [1]:
#import forecastio 
import arrow #A much mooooooore friendlier version than datetime package. I found my new datetime love!
import json 
import datetime as dt
from progressbar import ProgressBar
import forecastio
import numpy as np

"""
Forecastio is a great website for grabbing weather data. For the free version, they're only allowing 1000 API calls per day. 
But for every call, you'll grab 1 day of comprehensive data from all the aggregated sources that they're using - which is a lot.
Check out their website!
https://developer.forecast.io/
"""




api_key = 'd2af52fa407f31e459ca628c71c9edaeaa'
lati = '37.817773'
long = '-122.272232'

start = dt.datetime(2006,1,1)
end = dt.datetime(2008,1,1)

f_name="{0}--{1}-at-SanFran".format(end.date(),start.date(),lati=lati,long=long)

"""
pbar=ProgressBar()
with open(f_name+'.json', 'a+') as file:
    for c_date in pbar(arrow.Arrow.range('day', start, end)):
        fore = forecastio.load_forecast(api_key, lati, long,time=c_date)
        data=fore.json
        json.dump(data, file)
        file.write('\n')
        print c_date
"""
Out[1]:
"\npbar=ProgressBar()\nwith open(f_name+'.json', 'a+') as file:\n    for c_date in pbar(arrow.Arrow.range('day', start, end)):\n        fore = forecastio.load_forecast(api_key, lati, long,time=c_date)\n        data=fore.json\n        json.dump(data, file)\n        file.write('\n')\n        print c_date\n"
In [2]:
##### import pandas as pd
import pandas as pd 


'''
print pd.DatetimeIndex(pd.to_datetime(csv['sunriseTime'],unit='s',utc=True),tz=timezone('UTC')) #--timezone output are all in UTC, need to convert later
#Probably would need to do something like the following
import datetime as dt 
import pytz
time_now =  dt.datetime.utcnow()  #if inputting POSIX unix epoch time, use utcfromtimestamp() function 
pacific_time=pytz.timezone('America/Los_Angeles')
with_utc = time_now.replace(tzinfo=pytz.utc) 
with_pacific = with_utc.astimezone(pacific_time)
'''
daily_df=pd.DataFrame()
with open('weather_2014-01-01---2016-01-01.json','rb') as f:
    lines=f.readlines()
    for line in lines:
        try:
            json = pd.read_json(line)['daily']['data'][0]
            #You can get the column names by json.keys()
            df=pd.DataFrame(json,index=(pd.to_datetime(json['sunriseTime'],unit='s'),),
                            columns=[u'summary', u'sunriseTime', u'apparentTemperatureMinTime', u'moonPhase', u'icon', u'precipType', 
                                   u'apparentTemperatureMax', u'temperatureMax', u'time', u'apparentTemperatureMaxTime', u'sunsetTime', 
                                   u'pressure', u'windSpeed', u'temperatureMin', u'apparentTemperatureMin', u'windBearing', u'temperatureMaxTime', 
                                   u'temperatureMinTime'])
            daily_df=daily_df.append(df)
        except KeyError: 
            pass
            #Finished Appending

daily_df['averageTemperature']=(daily_df['temperatureMax']+daily_df['temperatureMin'])/2  #Creating average Temperature dynamically
        
In [3]:
#Now finally for some real fun
df_2015=daily_df['2015']['averageTemperature'].copy()
df_2014=daily_df['2014']['averageTemperature'].copy()

"""
#Notice I'm doing something extremely interesting here. 
The problem is that I'm trying to compare year 2014 and year 2015 with the same day on the same graph, but datetime don't have 
this logic of removing the year from the date. So I created a new index starting year 2015-01-01 till the end and put them on the same index
Next, I tried to hide the year from the graph. A very Hacky way to do this, but works fine.
"""


new_index=pd.date_range('2015-01-01', periods=365, freq='D') 
new_df=pd.DataFrame({'Year 2014':df_2014.values, 'Year 2015':df_2015.values},index=new_index)


#Create a new column of Strings categorizing each row by months
def month_cat(num):
    num=int(num)
    if num==1: return 'Jan'
    if num==2: return 'Feb'
    if num==3: return 'Mar'
    if num==4: return 'Apr'
    if num==5: return 'May'
    if num==6: return 'Jun'
    if num==7: return 'Jul'
    if num==8: return 'Aug'
    if num==9: return 'Sep'
    if num==10: return 'Oct'
    if num==11: return 'Nov'
    if num==12: return 'Dec'

#new_df.drop('month_category', axis=1, inplace=True)
cat_list=list()
for i in new_df.index.month:
    cat_list.append(month_cat(i))
new_df['month_category']=cat_list



import datetime
import matplotlib.pyplot as plt

%matplotlib inline
plt.style.use('ggplot')

from IPython.core.display import HTML
HTML("<style>.container { width:100% !important; }</style>") #Makes the graphs fitting window size of the browser. 
Out[3]:
In [4]:
print "Time to look at the basic statistics of our data"
print new_df[['Year 2014', 'Year 2015']].describe()

print "-"*100
print "I was expecting both to have count as 365. Looks like might have some NAN values. The next thing I'm going to try to do is fill each NAN with the average temperature it is grouped in per month"

nan_index= pd.isnull(new_df[['Year 2015']]).any(1).nonzero()[0]
#print nan_index #[ 50  51 156 239 240 308 310 337 338]
print new_df.iloc[nan_index]

print "-"*100
print "I might find a more pythonic way to do this in the future, but right now I'm jus going to calculate every month's average and fill the NAN row in the associated month."
print "-"*100
print "Before conversion"
print new_df['Year 2015'].iloc[nan_index]

month_avg_2015={}
for month in new_df.groupby('month_category').groups.keys():
    month_avg_2015[month]=new_df.groupby('month_category')['Year 2015'].get_group(month).mean()

    
for nans in nan_index:
    new_df['Year 2015'].iloc[nans]=month_avg_2015[new_df['month_category'].iloc[nans]]

print "-"*100
print "After conversion"
print new_df['Year 2015'].iloc[nan_index]
Time to look at the basic statistics of our data
        Year 2014  Year 2015
count     365.000        356
unique    337.000        335
top        59.705         56
freq        3.000          2
----------------------------------------------------------------------------------------------------
I was expecting both to have count as 365. Looks like might have some NAN values. The next thing I'm going to try to do is fill each NAN with the average temperature it is grouped in per month
           Year 2014 Year 2015 month_category
2015-02-20     56.56       NaN            Feb
2015-02-21     57.13       NaN            Feb
2015-06-06    57.755       NaN            Jun
2015-08-28    64.115       NaN            Aug
2015-08-29    62.925       NaN            Aug
2015-11-05    64.965       NaN            Nov
2015-11-07     61.01       NaN            Nov
2015-12-04     60.72       NaN            Dec
2015-12-05    60.555       NaN            Dec
----------------------------------------------------------------------------------------------------
I might find a more pythonic way to do this in the future, but right now I'm jus going to calculate every month's average and fill the NAN row in the associated month.
----------------------------------------------------------------------------------------------------
Before conversion
2015-02-20    NaN
2015-02-21    NaN
2015-06-06    NaN
2015-08-28    NaN
2015-08-29    NaN
2015-11-05    NaN
2015-11-07    NaN
2015-12-04    NaN
2015-12-05    NaN
Name: Year 2015, dtype: object
----------------------------------------------------------------------------------------------------
After conversion
2015-02-20    58.53519
2015-02-21    58.53519
2015-06-06    60.96034
2015-08-28    66.01207
2015-08-29    66.01207
2015-11-05    56.00054
2015-11-07    56.00054
2015-12-04      51.275
2015-12-05      51.275
Name: Year 2015, dtype: object
In [5]:
print new_df.describe() #Data looks a lot better now. Finally going into some plotting. 


line_plot = new_df.plot(kind='line',figsize =(30,10),title="Year 2014 and 2015 Temperatures",lw=2,fontsize=15)
line_plot.set_xlabel("Months(Disregard Jan 2015, don't know how to get rid of it)")
line_plot.set_ylabel("Farenheit")
"""
Can't tell much different from the simple line plot. 
"""
        Year 2014  Year 2015 month_category
count     365.000        365            365
unique    337.000        340             12
top        59.705         56            Jul
freq        3.000          2             31
Out[5]:
"\nCan't tell much different from the simple line plot. \n"
In [6]:
line_plot = new_df.plot(kind='density',figsize =(30,10),title="Year 2014 and 2015 Temperatures", lw=3.0, fontsize=20)
"""
Density plot shows in Year 2014, the temperature seemed a bit more mild and stable in comparison to Year 2015 which had a wider spread. 
"""
Out[6]:
'\nDensity plot shows in Year 2014, the temperature seemed a bit more mild and stable in comparison to Year 2015 which had a wider spread. \n'
In [7]:
new_df.plot(kind='box',figsize =(8,10),title="Year 2014 and 2015 Temperatures")
'''
The Box plot also confirms this. 
'''
Out[7]:
'\nThe Box plot also confirms this. \n'
In [8]:
grouped = new_df.groupby('month_category')
plotting = grouped.plot()

#axis=0 along the row
#axis=1 along the column
#http://stackoverflow.com/questions/26886653/pandas-create-new-column-based-on-values-from-other-columns quite useful save for later 
In [9]:
print new_df.plot(kind='scatter', x='Year 2014', y='Year 2015', figsize=(10,10)) #Scatter plot


'''
Scatter plot looks like there is a moderate positive relationship between the two years. There's a lot of clusters in the middle whereas higher temperatures had more extreme outliers. 
'''
Axes(0.125,0.125;0.775x0.775)
Out[9]:
"\nScatter plot looks like there is a moderate positive relationship between the two years. There's a lot of clusters in the middle whereas higher temperatures had more extreme outliers. \n"
In [10]:
from scipy import stats

t_stat, two_tail_p_value=stats.ttest_ind(new_df['Year 2014'],new_df['Year 2015'],equal_var=False)  #Conducts Welch's Test. Note that Pvalue response from stats.ttest_ind default is two tail. 
print "T-Statistic is: {0} and P-Value is: {1}".format(t_stat,two_tail_p_value)

"""
With my alpha of 0.05, and I received a P-Value of 0.577, I have failed to prove that there is a significant difference of mean between Year 2014 Temperature and Year 2015. 
Though from the graphs, we can say that Year 2014 seemed to have smaller spread of temperature - a much more stable year than 2015. 
"""
T-Statistic is: 0.557834629664 and P-Value is: 0.577133796932
Out[10]:
'\nWith my alpha of 0.05, and I received a P-Value of 0.577, I have failed to prove that there is a significant difference of mean between Year 2014 Temperature and Year 2015. \nThough from the graphs, we can say that Year 2014 seemed to have smaller spread of temperature - a much more stable year than 2015. \n'

ShareIt

Blogs I like