1 Key Takeaways

We analyzed cuts at the energy plants in Turkey and here is our results so far.

2 Overview and Preparation

We obtained energy plant cuts data in Turkey between 2012 and 2018 from the Energy Transparency Platform. Raw data contains 73311 rows and 7 variables.

Objectives of this project is as follows:

Before we start here are our required libraries;

knitr::opts_chunk$set(echo = TRUE,warning=FALSE)
library(tidyverse)
library(tidytext)
library(readxl)
library(forcats)
library(rlang)
library(scales)
library(knitr)
library(lubridate)
library(RColorBrewer)
library(plotly)
library(treemapify)
library(stringr)
library(data.table)
library(prettydoc)
library(kableExtra)
library(splitstackshape)
library(grid)
library(tm)
library(wordcloud)
library(gridExtra)
library(dplyr)
# Create a temporary file
tmp<-tempfile(fileext=".csv")
# Download file from repository to the temp file
download.file("https://github.com/MEF-BDA503/gpj18-first/blob/master/dataset_candidates/ArizaBakim-01012008-25112018.csv?raw=true",destfile=tmp)
# Reading data and setting colnames properly due to problems to knit markdown document and github pages with Turkish characters.
cuts<-read.csv(tmp, encoding= "Latin-1", sep = ";")
colnames(cuts) <-c("Plant.Name","UEVCB","Start.Date","End.Date","Established.Power","Power.atOutage","Reason")
# Checking what we have as data.
str(cuts)
## 'data.frame':    74036 obs. of  7 variables:
##  $ Plant.Name       : Factor w/ 891 levels "","    ","  2.Ünite kazan siklonunda kısmi curuflanma ve vortex silindirinde kısmi yamulmadan dolayı ünite servis harici edildi. ",..: 733 889 345 222 66 508 401 572 730 485 ...
##  $ UEVCB            : Factor w/ 748 levels ""," AES ENTEK ELEKTRİK ÜRETİMİ A.Ş. (KOCAELİ)",..: 635 744 297 186 14 411 342 473 625 389 ...
##  $ Start.Date       : Factor w/ 62953 levels "","1.1.2013 00:00",..: 16880 13994 4693 17488 39657 24102 24857 28100 46422 60030 ...
##  $ End.Date         : Factor w/ 42169 levels "","1.1.2013 16:59",..: 11246 9366 3031 11662 26567 16055 16567 18722 31153 40256 ...
##  $ Established.Power: Factor w/ 595 levels "","0","0,06",..: 256 443 120 224 480 489 531 336 379 420 ...
##  $ Power.atOutage   : Factor w/ 837 levels "","-100","-110",..: 107 9 9 9 9 439 308 9 123 9 ...
##  $ Reason           : Factor w/ 31166 levels "","'F'DEĞİRMENİ ARIZASI",..: 4148 27233 12956 24146 15255 1195 11705 11670 4519 7164 ...
# Change type of time and date
cuts$Start.Date <- dmy_hm(cuts$Start.Date)
cuts$End.Date<- dmy_hm(cuts$End.Date)
# Turn power capacity variables into numeric
cuts$Established.Power <- str_replace(cuts$Established.Power,"[:punct:]","") 
cuts$Power.atOutage <- str_replace(cuts$Power.atOutage,"[:punct:]","") 
cuts$Established.Power <- str_replace(cuts$Established.Power,",","") 
cuts$Power.atOutage <- str_replace(cuts$Power.atOutage,",","")
cuts$Established.Power <- as.factor(cuts$Established.Power)
cuts$Power.atOutage <- as.factor(cuts$Power.atOutage)
cuts$Established.Power <-as.numeric(levels(cuts$Established.Power))[cuts$Established.Power]
cuts$Power.atOutage <-as.numeric(levels(cuts$Power.atOutage))[cuts$Power.atOutage]
cuts[c("Established.Power","Power.atOutage")]<-cuts[c("Established.Power","Power.atOutage")]/100
# Removing data that contains NA's.
cuts <- na.omit(cuts)
#Drop UEVÇB column, we do not think it will be any help for analysis.
cuts <- subset(cuts, select=-UEVCB)
# Let's check the data for yearly total cuts.
yearly_cuts <- cuts %>% group_by(year=floor_date(Start.Date,"year")) %>% summarize(Start.Date=n())
yearly_cuts
## # A tibble: 7 x 2
##   year                Start.Date
##   <dttm>                   <int>
## 1 2012-01-01 00:00:00        338
## 2 2013-01-01 00:00:00       4791
## 3 2014-01-01 00:00:00      10650
## 4 2015-01-01 00:00:00      12684
## 5 2016-01-01 00:00:00      12763
## 6 2017-01-01 00:00:00      10880
## 7 2018-01-01 00:00:00      21207

Look like there is an average of 10-12k cuts per year in the database with the exception of 2018, where the number of cuts seems to be much higher. We believe this is simply the result of transparency platform adding more data from plants or more plants started to use this database in 2018.

#Using function provided by Berk Orbay
computer_friendlying<-function(mytext,spaceas="-"){
  mytext<-gsub(" ",spaceas,mytext)
  mytext<-gsub("ç","c",mytext)
  mytext<-gsub("ş","s",mytext)
  mytext<-gsub("ğ","g",mytext)
  mytext<-gsub("ü","u",mytext)
  mytext<-gsub("ö","o",mytext)
  mytext<-gsub("ı","i",mytext)
  mytext<-gsub("Ç","C",mytext)
  mytext<-gsub("Ş","S",mytext)
  mytext<-gsub("Ğ","G",mytext)
  mytext<-gsub("Ü","U",mytext)
  mytext<-gsub("Ö","O",mytext)
  mytext<-gsub("İ","I",mytext)
  return(mytext)
}
cuts$Reason <- computer_friendlying(cuts$Reason);
# Let's put the yearly cuts into a bar graph on raw data before we move to data cleaning.
ggplot(data=yearly_cuts, aes(y=Start.Date, x=factor(year(year)), fill=factor(year(year))))+
  geom_bar(stat="identity")+
  labs(x="Year", y="Incident Count", title="Yearly Total Incidents")+
  theme_light()+
  scale_fill_brewer(palette="PuBuGn")+
  theme(legend.position="none")

3 Data Preprocessing and Wrangling

The datase has mostly text based data. Initial overview of the database shows a lot of errors, some of which come from localization problems, others simply misspelled. In this part, we’ve attempted to make it more tidy by making plant names more distinct, cut reasons more clear.

#Adding duration variable as duration of the cuts as hours and capacity usage ratio as capacity usage at the time of the cut's ratio to the total capacity.
cuts <- cuts %>% mutate(Duration = difftime(End.Date,Start.Date,units="hours")) %>% mutate(Capacityratio = Power.atOutage / Established.Power)
#Rounding capacity ratio to two decimals.
cuts$Capacityratio <- round(cuts$Capacityratio,2)
# Make all strings lower case so that can be cleaned easier.
cuts <- cuts %>% mutate_at(.vars=c("Plant.Name", "Reason"), funs(str_to_lower(.,locale="tr")))
#With new variables let's see our data.
kable(cuts[1:5,]) %>%
  kable_styling(bootstrap_options = "striped", full_width = F)
Plant.Name Start.Date End.Date Established.Power Power.atOutage Reason Duration Capacityratio
silopi tes 2016-10-17 08:10:00 2016-10-17 10:59:00 2.70 1.2 3.unite-kazan-boru-patlagindan-dolayi-servis-harici 2.816667 hours 0.44
zorlu enerji lüleburgaz santralı 2016-06-15 15:56:00 2016-06-15 15:59:00 57.96 0.0 turbın-stoll-detected-trıp 0.050000 hours 0.00
enerjisa - hacınınoğlu 2018-10-11 00:22:00 2018-10-11 05:59:00 142.28 0.0 gol-seviyesi- 5.616667 hours 0.00
cengiz 240mw samsun gaz yakıtlı kombine çevrim enerji santrali 2016-03-17 19:01:00 2016-03-17 22:59:00 23.89 0.0 sogutma-suyu-arizasi 3.966667 hours 0.00
acarsoy denizli doğalgaz santrali 2014-11-27 08:44:00 2014-11-27 23:59:00 0.63 0.0 gt-senkron-problemi 15.250000 hours 0.00
#Detect similiar names with Fuzzy Text matching. This gives us possible pairs to be matched so that we can use str_replace_all to clean up data.
uniqueNames <- unique(cuts$Plant.Name)
name_distances <- list()
i <- 1
for (ind in uniqueNames){
  name_distances[[i]] <- agrep(ind, uniqueNames, value=T)
  i <- i+1
}
name_distances <- unique(Filter(function(x) {length(x) > 1}, name_distances))
#Fixing detected string issues.
cuts$Plant.Name <- cuts$Plant.Name %>% 
  str_replace_all("[Yy]", "i") %>%
  str_replace_all("enerj.sa", "enerjisa") %>%
  str_replace_all("yeniköy ts", "yeniköy tes") %>%
  str_replace_all("ieniköi tes", "yeniköy tes") %>%
  str_replace_all("^ova elektrik", "gebze ova elektrik") %>%
  str_replace_all("yata??an .*", "yata??an tes") %>%
  str_replace_all("köklüce$", "köklüce hes") %>%
  str_replace_all(".* entek", "entek") %>%
  str_replace_all("kürtün-hes", "kürtün hes") %>%
  str_replace_all("^rwe_turcas_guney", "denizli rwe_turcas_guney") %>%
  str_replace_all("tekirda?? santrali.*", "modern enerji tekirda?? santrali") %>%
  str_replace_all("karada??$", "karada?? res") %>%
  str_replace_all(".?menzelet( hes)?", "menzelet hes") %>%
  str_replace_all("\\.", "") %>%
  str_replace_all("hidro(\\s?elektrik santral[yi]| e\\.?s)", " hes") %>%
  str_replace_all("(termik santral[yi]|\\sts\\s?)", " tes") %>%
  str_replace_all("tunçbilektes", "tunçbilek tes") %>%
  str_replace_all("d.*(k.*)?ç.*(s.*)?", "DGKC") %>%
  str_replace_all("jeotermal (e.*s.*)", "jes")
  
cuts$Reason <- cuts$Reason %>%
  str_replace_all("[Yy]", "i") %>%
  str_replace_all("t(.r|r.)b.n", "türbin") %>%
  str_replace_all("ar.zas.?.?", "ariza") %>%
  str_replace_all("so.utma", "sogutma") %>%
    str_replace_all("dolaiı", "dolayı") %>%
  str_replace_all("(?<!\\d)\\.", "") %>%
  str_replace_all(".n.te", "unite") %>%
  str_replace_all("suiu", "suyu") %>%
   str_replace_all("ar[ı?]za", "ariza") %>%
  str_replace_all("reg.lat.r", "regulator")

Based on the plant name, we created a new column called Plant.Type which stores the type of the plant.

The abbreviations are as following: * HES : Hydroelectricty Plant * RES : Wind Energy Plant * TES : Thermal Power Plant * DGKC: Natural Gas Combined Cycle Power Plant * JES : Geothermal Energy Plant * BES : Biomass Energy Plant/Biogas Plant

#Categorising type of plants so we can do type based analysis later on.
cuts<- cuts %>%
  mutate(Plant.Type=ifelse(grepl("hes", cuts$Plant.Name, ignore.case = T), "HES", 
         (ifelse(grepl(" res\\s?|rüzgar", cuts$Plant.Name, ignore.case = T), "RES",
         (ifelse(grepl("( tes\\s?|termik santral|ithal kömür|bolu göynük|eski??ehir endüstriyel|alia??a çakmaktepe|enerjisa tufanbeyli|ataer)", cuts$Plant.Name, ignore.case = T), "TES",
         (ifelse(grepl("(d.*k.*ç.*(s.*)?|bosen|acarsoy denizli|akenerji|ambarli|m.*osb|kombine|paner|enerjisa (bandirma|kentsa)|kojsant|zorlu enerji|rwe_turcas|kojen|ugur enerji|isbirligi-enerji|gebze ova elektrik)", cuts$Plant.Name, ignore.case = T), "DGKC",
         (ifelse(grepl("(jes|jeotermal)", cuts$Plant.Name, ignore.case =T), "JES",
         (ifelse(grepl("biyokütle|biogaz", cuts$Plant.Name, ignore.case =T), "BES",       
                "Other"))))))))))))
cuts$Plant.Type <- as.factor(cuts$Plant.Type)

In order to categorise the reasons, we performed a word count for reason column, laying out most encountered words.

gerekcewordcount <- cSplit(cuts, "Reason", sep = " ", direction = "long") %>%
      group_by(Reason) %>%
      dplyr::summarise(Count = n())
arrange(gerekcewordcount,desc(Count))
## # A tibble: 26,159 x 2
##    Reason                  Count
##    <fct>                   <int>
##  1 türbin-ariza             4503
##  2 ariza                    2734
##  3 kazan-boru-patlagi        617
##  4 santral-ariza             588
##  5 sogutma-suyu-ariza        477
##  6 rms-gaz-regulator-ariza   374
##  7 gt-ariza                  351
##  8 degirmen-ariza            289
##  9 sogutma-sistemi-ariza     289
## 10 kompresor-ariza           283
## # ... with 26,149 more rows
gerekcewordcounttop20 <- gerekcewordcount %>% top_n(n=20)
## Selecting by Count
ggplot(gerekcewordcounttop20,aes(reorder(Reason,Count),Count,fill="red"))+
  geom_bar(stat="identity")+
  coord_flip()+
  theme_bw()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),legend.position  = "none")

This could be advanced further via removing stop words(ex:ve) and numbers.

Let’s not stop here and look for which words follow each other using n-gram analysis. We will take bi-gram anaylsis

gerekce_bigram <- cuts %>% unnest_tokens(bigram, Reason, token = "ngrams", n = 2)
#Here we see what words follow the other one according to plant type.
bigram_count <- gerekce_bigram %>% group_by(Plant.Type) %>% count(bigram,sort=TRUE) %>% na.omit()
bigram_count
## # A tibble: 50,707 x 3
## # Groups:   Plant.Type [6]
##    Plant.Type bigram            n
##    <fct>      <chr>         <int>
##  1 RES        türbin ariza   3648
##  2 TES        kazan boru     1439
##  3 TES        boru patlagi   1347
##  4 TES        devam ediior   1019
##  5 Other      sogutma suyu   1018
##  6 TES        devreie alma    983
##  7 HES        sogutma suyu    954
##  8 HES        unite 1         836
##  9 HES        unite 2         761
## 10 Other      sistemi ariza   720
## # ... with 50,697 more rows
bigram_counttop20 <- bigram_count %>% top_n(n=20)
## Selecting by n
ggplot(bigram_counttop20,aes(reorder(bigram,n),n,fill=n))+
  geom_bar(stat="identity")+
  coord_flip()+
  facet_wrap(~Plant.Type,scales="free")+
  theme_bw()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank(),legend.position  = "none")

Furthermore let’s visualize most frequent words in cut reason with a wordcloud.

#Inspired from https://georeferenced.wordpress.com/2013/01/15/rwordcloud/
cutsReason.Corpus<-Corpus(VectorSource(cuts$Reason))
cutsReason.Corpus<-tm_map(cutsReason.Corpus, PlainTextDocument)
cutsReason.Corpus<-tm_map(cutsReason.Corpus,tolower)
wordcloud(cutsReason.Corpus,min.freq = 5,
          max.words=100, random.order=FALSE, rot.per=0.15, 
          colors=brewer.pal(8, "PuOr"),scale=c(4,0.8))

There are two types of cuts at the overall level. Either due to a malfunction or due to a planned activity such as turnaround maintenances, tests, capacity reductions due to economic reasons or supply demand balance etc. Let’s separate them according to type of cut. We searched for words that imply an planned activity at the reason of cut and assigned “Planned Activity” as variable to a new column. All others are assigned to “Malfunction”.

cuts <- cuts%>% mutate(TypeofCut=ifelse(grepl("(bak.m|[cç]al[yi][s??]ma|devreye alma|yük alma|test|planl[yi]|devre di[s??]i)", Reason, ignore.case = T, perl=T), "Planned Activity", "Malfunction"))

4 Data Analysis

4.1 Initial Exploration

Let’s take a look at our final dataset.

str(cuts)
## 'data.frame':    73313 obs. of  10 variables:
##  $ Plant.Name       : chr  "silopi tes" "zorlu enerji lüleburgaz santralı" "enerjisa - hacınınoğlu" "cengiz 240mw samsun gaz iakıtlı kombine çevrim enerji santrali" ...
##  $ Start.Date       : POSIXct, format: "2016-10-17 08:10:00" "2016-06-15 15:56:00" ...
##  $ End.Date         : POSIXct, format: "2016-10-17 10:59:00" "2016-06-15 15:59:00" ...
##  $ Established.Power: num  2.7 57.96 142.28 23.89 0.63 ...
##  $ Power.atOutage   : num  1.2 0 0 0 0 4.2 3 0 1.35 0 ...
##  $ Reason           : chr  "3.unite-kazan-boru-patlagindan-dolaii-servis-harici" "türbin-stoll-detected-trıp" "gol-seviiesi-" "sogutma-suyu-ariza" ...
##  $ Duration         : 'difftime' num  2.81666666666667 0.05 5.61666666666667 3.96666666666667 ...
##   ..- attr(*, "units")= chr "hours"
##  $ Capacityratio    : num  0.44 0 0 0 0 0.64 0.04 0 0.3 0 ...
##  $ Plant.Type       : Factor w/ 6 levels "DGKC","HES","JES",..: 6 1 4 1 4 6 4 1 6 4 ...
##  $ TypeofCut        : chr  "Malfunction" "Malfunction" "Malfunction" "Malfunction" ...

Here are the explanations for variables:

  • Plant.Name : Name of the power plant.
  • Start.Date : Start date time of the cut.
  • End.Date : End date/time of the cut.
  • Established.Power : Total power at the plant.
  • Power.atOutage : Capacity at the time of incident
  • Reason : Reason of the cut
  • Duration : Length of the cut in hours.
  • Capacityratio : Proportion of the capacity at the time of the cut to max capacity.
  • Plant.Type : Type of the plant

Below we see capacity of the plants based on types. Labels are total power output of the plants in Turkey while bars reprenset one plants average output.

cuts %>%
  select(Plant.Type, Plant.Name, Established.Power) %>%
  distinct(Plant.Name, Plant.Type, Established.Power) %>%
  group_by(Plant.Type) %>%
  summarize(Mean=mean(Established.Power), Total=sum(Established.Power)) %>%
    ggplot(.)+
    geom_bar(aes(x=reorder(Plant.Type, -Mean), y=Mean, fill=Plant.Type), stat="identity")+
    geom_text(aes(x=Plant.Type, y=Total/100, label=signif(Total, 2)))+
    labs(x="", y="Average Power Output MWe", title="Power Output Based on Plant Type in Turkey", x="Plant Type")+
    theme_light()+
    scale_fill_brewer(palette="Greens")+
    theme(legend.position="none")+
    scale_y_continuous(sec.axis=sec_axis(~.*100, name="Total Power Output MWe"))

4.2 Analysis on Malfunctions

Our histogram that shows distribution of the cuts durations looks messy due to outliers. Here is a quick workaround by fixing outliers to a maximum value inspired by https://edwinth.github.io/blog/outlier-bin/. Same person has a nice package to deal with such situations, however its dependencies does not work with Shiny.

It looks like number of cuts are mostly between 0-2 hours with some some values up to 24 hours. Planned activities are much less than malfunctions however, their distribution look similar.

When we look at the capacity usage ratio at the malfunctions, there is a stack between 0-10%. However except that distribution is close to normal. On planned activity, there are some cuts that are planned and implemented while plant is working which I found interesting.

cuts %>% mutate(Duration_outlierfixed = ifelse(Duration > 24, 24, Duration))%>%
ggplot(aes(x=Duration_outlierfixed))+
  geom_histogram(bins=50)+
  facet_wrap(~TypeofCut)+
    theme_bw()+
  scale_x_continuous(limits=c(0,24))+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  xlab("Duration with Fixed Outlier")+
  ggtitle("Distribution of Cut Duration In Terms of Hours")

ggplot(cuts,aes(x=Capacityratio))+
  geom_histogram(bins=50)+
  facet_wrap(~TypeofCut)+
  scale_x_continuous(labels = percent,limits = c(0,1))+
  scale_y_continuous(limits=c(0,3000))+
  theme_bw()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  xlab("Capacity Ratio")+
  ggtitle("Distribution of Capacity Ratio at the Time of Cut")

Another thing to see is at which type of the plants cuts take longer time to resolve. Let’s see it for both cut types.

#Since we have one data point less than 0, we added Duration is bigger than 0 filter to exclude this.
ggplotly <- cuts %>% mutate(Duration_outlierfixed = ifelse(Duration > 24, 24, Duration))%>% filter(Duration > 0) %>%ggplot(aes(fill=Plant.Type))+
  geom_boxplot(aes(Plant.Type,Duration_outlierfixed))+
  theme_bw()+
  theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())+
  facet_grid(~TypeofCut)
ggplotly(ggplotly)

Result of this plot is interesting. First it shows that cuts related to malfunctions distributed completely differently than planned cuts. It probably means that planned activities are cuts that require longer time while malfunctions are usually smaller issues to solve.

Let’s see how many malfunctions the power plants have reported since late 2014.

#Gather plants that reported more than 1000 malfunctions in the last 6 years
m_count <- cuts %>%
  filter(TypeofCut=="Malfunction") %>%
  group_by(Plant.Name) %>%
  summarize(m_count=n()) %>%
  arrange(desc(m_count)) %>%
  filter(m_count >= 1000)
#Group them according to plant name and quarters.
m_plants <- as.vector(m_count$Plant.Name)
malf <- cuts %>%
  filter(Plant.Name %in% m_plants, TypeofCut=="Malfunction") %>%
  mutate(quarter=lubridate::quarter(Start.Date, with_year = T)) %>%
  group_by(Plant.Name, quarter) %>%
  summarize(malf=n())
#Visualize
malf$quarter=as.character(malf$quarter)    
ggplotly(
ggplot(malf, aes(x=quarter))+
  coord_flip()+
  theme_bw()+
  geom_bar(aes(y=malf, fill=Plant.Name), stat="identity")+
  theme(legend.position = "bottom", legend.title = element_text("Plant Name"))+
  labs(x="Quarter", y="Malfunction Count", title="Quarterly Fault Count of Top Frequently Malfunctioning Plants")
)

When we look at the graph, we can roughly see that thermal plants make up most of the top malfunctioning plants in Turkey. 2018’s 3rd quarter showed a spike in reported malfunctions, from our graph it seems like two major plants had numerous faults during this quarter, eren enerji and sebenoba reporting 502 and 482 cuts, respectively. A power plant is a huge facility that utilizes large numbers of various equipments. A fault may occur on any of them. So, to get a more explanatory outcome, we’ve decided to categorise these cuts by malfunction type based on the reason written.

#Categorise malfunctions
catmalf <- cuts %>%
  filter(TypeofCut=="Malfunction") %>%
  mutate(Malf.Category =ifelse(grepl("(t.rb[yi]n|g.?t.|kompr[ae]s[oö]r|pompa|fan|de[g??]irmen|vibrasyon|makin[ea]|trip|motor|govern[oöe]r|ate[s??\\?]leme|ayar kanat?|hidrolik start|c[uü]ruf)", Reason, ignore.case = T, perl=T), "Rotating Equipment Failure",
           (ifelse(grepl("(reg[uü]lat[oö]r|trafo|(elektrik|enerji) kesinti|so[g??]utma su.?.?|[gj]enerat[oö]r|elektriksel|154|bara|[s??]ebeke|santral)", Reason, ignore.case = T, perl=T), "Electrical or Other Utilities Failure",   
           (ifelse(grepl("((?<!\\w)su |k[oö\\?]m[üu\\?]r|gaz basin[çc]?)", Reason, ignore.case = T, perl=T), "Feedstock Issues",
           (ifelse(grepl("(vana|kazan|boru|hatt[yi]|e[??s]anjör|val(f|ve)|kablo|air preheater|ka.ak)", Reason, ignore.case = T, perl=T), "Static Equipment Failure", 
           (ifelse(grepl("(plc|dcs|haberle[??s]me|otom[oa]syon|scada|[ei]nstr[uü]?m.{2,3} (hava|air)|[\\?i]kaz)", Reason, ignore.case = T, perl=T), "Control and Automation Systems Failure",
           (ifelse(grepl("(bo[??g]ulma|çiftçi|ara[çc] dü[??s]|tarim|atmosfer(ik)?|ya[g??][yi][??s])", Reason, ignore.case = T, perl=T), "Outside Factors",
           (ifelse(grepl("(^(sistem )?ar.za(si|nin)?( devam.)?$|^$|[uü]nite ar[yi]za(s[iy])?)", Reason, ignore.case = T, perl=T), "Unspecified",
                "Other"))))))))))))))

So now there are several different malfunction types from rotating equipments such as pumps, turbines to static equipments like pipelines and heat exchangers. Electrical or other utility based cuts are numerous too, note that electrical failures here are problems in electricity that power plant equipments use, not the electricity they produce, therefore they are counted as utilities. There are a few strange cases too, a hydroelectricity power plant shutting down due to a car falling into the dam lake is one of such cases.

m_by_type<- catmalf %>% 
  group_by(Plant.Type, Malf.Category) %>%
  filter(Plant.Type!="Other")%>%
  count() %>%
  ungroup()%>%
  group_by(Plant.Type)%>%
  mutate(perc=`n`/sum(`n`))
#Plot pie charts for most occured malfunction type
plot_ly(textposition = 'inside',
        textinfo = 'label+percent',
        insidetextfont = list(color = '#FFFFFF'),
        marker = list(colors = colors,
                      line = list(color = '#FFFFFF', width = 1))) %>%
  add_pie(data = subset(m_by_type, Plant.Type=="TES"), labels = ~Malf.Category, values = ~n,
          name = "Thermal Energy Plant", domain = list(x = c(0, 0.35), y = c(0.50, 0.95))) %>%
  add_pie(data = subset(m_by_type, Plant.Type=="HES"), labels = ~Malf.Category, values = ~n,
          name = "Hydroelectricity Plant", domain = list(x = c(0.35, 1), y = c(0.50, 0.95))) %>%
  add_pie(data = subset(m_by_type, Plant.Type=="RES"), labels = ~Malf.Category, values = ~n,
          name = "Wind Energy Plant", domain = list(x = c(0, 0.35), y = c(0, 0.45))) %>%
  add_pie(data = subset(m_by_type, Plant.Type=="DGKC"), labels = ~Malf.Category, values = ~n,
          name = "Natural Gas CC Plant", domain = list(x = c(0.35, 1), y = c(0, 0.45))) %>%
  layout(title = "Malfunction Type by Plant", showlegend = F,
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = TRUE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         annotations = list(
      list(x = 0.09 , y = 1.0, text = "Thermal Energy Plant", showarrow = F, xref='paper', yref='paper'),
      list(x = 0.8 , y = 1.0, text = "Hydroelectricity Plant", showarrow = F, xref='paper', yref='paper'),
      list(x = 0.1 , y = 0.47, text = "Wind Turbine", showarrow = F, xref='paper', yref='paper'),
      list(x = 0.8 , y = 0.47, text = "Natural Gas CC Plant", showarrow = F, xref='paper', yref='paper')))

Thermal Plants have suffered from static equipment and rotating equipment failures almost in equal cases, hydroelectricty plants have had many electrical and utility issues. Being mostly comprised of rotating equipments, wind turbines’ majority of problems come from rotating equipments and among the reports of geothermal plants, almost half of them were electrical or other utility problems.

catmalf %>%
  filter(Capacityratio<=0.05 & Plant.Type %in% c("HES", "TES", "DGKC", "JES")) %>%
  select(Established.Power, Power.atOutage, Plant.Type, Malf.Category, Duration) %>%
  group_by(Plant.Type, Malf.Category) %>%
  summarise(count=n()) %>%
  mutate(perc=count/sum(count)) %>%
  filter(!Malf.Category %in% c("Outside Factors", "Other")) %>%
  ggplot(., aes(x=Malf.Category, y=perc, fill=Plant.Type))+
  geom_bar(stat="identity", position="dodge")+
  scale_y_continuous(limits=c(0,0.4), labels=percent)+
  theme_bw()+
  labs(x="Source of Shutdown", y="Percentage", title="Shutdown Causes")+
  scale_fill_brewer(palette="PuBuGn")+
  theme(legend.position = c(0.1,0.8), legend.title = element_text("Plant Type"))+
  scale_x_discrete(labels=c("Control and Automation", "Utilities", "Feedstock", "Rotating\nEquipment", "Static\nEquipment", "Unspecified"))

Wind turbines’ problems mostly came from rotating equipments, therefore there was no need putting in in our graph. Natural gas plants suffered shutdowns from rotating equipments most, most likely turbine related malfunctions. Curiously, thermal plant shutdowns usually came from static equipment, if we were to guess why, it’s probably because thermal plants work at much higher temperatures, the material lifecycle is shorter than other power plants and more prone to leaks and ruptures. Geothermal and hydroelectricty plants have suffered shutdowns from electrical and utility problem.

catmalf %>%
  filter(Capacityratio<=0.05, !Plant.Type=="Other") %>%
  group_by(Plant.Type) %>%
  summarise(Avg_sd=mean(Duration)) %>%
  ggplot(aes(x=Plant.Type))+
  geom_col(aes(y=Avg_sd, fill=Plant.Type))+
  scale_fill_brewer(palette="Purples")+
  theme_bw()+
  labs(y="Shutdown Duration (hours)", x="Plant Type", title="Average Shutdown Durations")

Thermal plants have about 12 hours of average shutdown duration. A shutdown is much more costly in hydroelectric plants and thermal plants.

5 Conclusion

Energy Transparency Platform’s cuts data is getting more thorough and widespread over time. There are much more data entries in 2017 and 2018 than previous years. This is promising because there is a huge data analysis potential in this field. However, the data in its current form is very cumbersome to deal with, here are some key problems;

After transforming our data to become somewhat analysable, we categorised plants and malfunction types to get some insight on what kind of problems power plants in Turkey were facing. Turkey’s biggest power sources are Thermal Plants and Hydroelectricity Plants, problems these type of plants were facing were more costly to us than others. The only source of information on the cuts were the in text form therefore we performed text mining to extract the categories, and also most encountered reasons. Here are a breakdown of our analysis results.

-Thermal Plants usually suffer from static or rotating equipment faults while hydroelectricity plants have a lot of problems regarding electrical equipment. -Regardless of plant, turbine malfunctions, as expected, forms the majority of plant cuts. -Thermal plants have reported more cuts than any other plants. -Every type of plant have a different leading cause for shutdowns. -Shutdowns last longer in HES’ and TES’.

6 References

https://georeferenced.wordpress.com/2013/01/15/rwordcloud/

https://plot.ly/r/bubble-charts/

https://edwinth.github.io/blog/outlier-bin/

https://www.tidytextmining.com/

https://stackoverflow.com

http://www.enerjiatlasi.com