A System for Detecting Refueling Behavior along Freight Trajectories and Recomme

时间:2022-09-23 10:45:43

Smart refueling can reduce costs and lower the possibility of an emergency. Refueling intelligence can only be obtained by mining historical refueling behaviors from big data; however, without devices, such as fuel tank cursors, and cooperation from drivers, these behaviors are hard to detect. Thus, detecting refueling behaviors from big data derived from easy-to-approach trajectories is one of the most efficient retrieve evidences for research of refueling behaviors. In this paper, we describe a complete procedure for detecting refueling behavior in big data derived from freight trajectories. This procedure involves the integration of spatial data mining and machine-learning techniques. The key part of the methodology is a pattern detector that extends the naive Bayes classifier. By drawing on the spatial and temporal characteristics of freight trajectories, refueling behaviors can be identified with high accuracy. Further, we present a refueling prediction and recommendation system to show how our refueling detector can be used practically in big data. Our experiments on real trajectories show that our refueling detector is accurate, and the system performs well.

spatial data mining; trajectory processing; big data

1 Introduction

esearch into smart refueling has become more important as drivers become more sensitive to fluctuating fuel prices. Previous research has been heavily focused on the use of theoretical models to analyze refueling behavior in small sets of experimental data [1]-[5]. However, intelligence on refueling behaviors can also be directly obtained from massive trajectories. Collective refueling intelligence is useful because refueling behaviors are summarized as a result of smart decisions by experienced freight drivers. Traffic systems using collective intelligence have performed well, both in scientific research and in real-world applications [6]. However, without additional devices and cooperation from drivers, refueling records are hard to collate and are often inaccurate. Traditional methods for monitoring refueling behaviors, such as floating cursors in a fuel tank, are not widely used because of the cost of equipment and maintenance. Thus, there are not huge volumes of refueling records, and we have to turn to trajectories to obtain data. Embedded GPS equipment has become common in the freight and logistics industries.

With elaborate algorithms, patterns can be recognized in trajectories; for example, we can identify whether a car is moving or not. If a car has stopped near a gas station, it is likely refueling. However, the subtle distinction between refueling and casual parking cannot be made.

1.1 Motivations

We have designed algorithms that can accurately identify refueling stops along trajectories. To benefit drivers, we have also built a refueling prediction and recommendation system based on collective intelligence. The main motivations of our research are no additional cost and powerful collective intelligence. We design the refueling-detection algorithms only by mining mass data from trajectories, and no additional devices are required. GPS is already installed in most cars; hence, our algorithms can be widely used without having to install, test, and maintain new devices. This saves costs. We were also inspired to design our data-mining algorithms partly because of successful use of collective intelligence in taxi services. Similarly, in the logistics and freight industries, refueling behaviors derived from collective intelligence can reveal industrial advantages and disadvantages.

1.2 Challenges and Contributions

Given enough spatial, temporal, and other information, a detector can follow simple rules, including rules about spatial constraints and constraints on traveled distance, to identify potential refueling stops. However, such methods rarely work well in practice because of a lack of valid prior knowledge. One crucial problem of all naive methods is the lack of reliable prior knowledge and training data set. To our knowledge, there are no publicly available records on refueling behavior, especially for the freight industry. Some data on refueling has been collected as part of research into fuel consumption and vehicle efficiency rather than refueling behavior [1]. Without customized sensors in the fuel tanks of cars, such information can only be obtained if drivers are willing to cooperate. This contributes to data inaccuracy and deficiency. Also, for rules-based decisions, prior knowledge extends further than simply routes and spatial distribution of gas stations. A solution therefore needs to be found for reliably collecting prior knowledge about refueling behavior. As well as valid prior knowledge, there are no training data sets because nothing can be directly observed from the spatial trajectories. Detectors cannot be optimized without training. Prior knowledge may offer the foundation for constructing a training data set.

The major drawback of naive methods is that they neglect the mutual influence of consecutive stops at nearby gas stations. For example, there may be two gas stations that are spatially and temporally close along a trajectory. The car has traveled such a distance from its last refueling stop that it is likely the driver will refuel at one of these gas stations. The detectors must classify one of these two eligible candidates as the refueling stop. Using a naive method, the first eligible candidate where the car firstly stopped by is classified as a refueling stop, and the result of this classification affects the other candidate. If the first stop near the first gas station is treated as the refueling stop, then the traveled distance from the last refueling stop is diminished for the next stop, and this makes the second candidate ineligible, even though the stop may be a possible alternative refueling stop [2]. The multiple hypothesis mechanism can handle this problem.

To overcome the above problems, we introduce several advanced data-mining and AI techniques, and we elaborate our algorithms. First, we propose a hill-climbing algorithm in local search and optimization. This algorithm is used to generate reliable prior knowledge about refueling behavior and construct the training data set. Then, we use a naive Bayes classifier algorithm to detect refueling stops with regard to multiple hypotheses. Last, we suggest that our detector has good extensibility by suggesting possible applications. Through abundant evaluations and data analysis, we show the robustness of our algorithms. The main contributions of this paper are

・a new method of inferring reliable prior knowledge about refueling behavior. This prior knowledge is derived from historical freight trajectories

・a method for determining freight refueling patterns based on our observations and processing of spatial trajectory data sets

・a solution to accurately detecting refueling stops along a single trajectory. This solution is based on naive Bayes.

・a real-time stream processing and batch-processing framework for processing and mining trajectory data

・a practical application for recommending refueling alternatives to drivers. This application is an extension of our detection algorithms.

The remainder of this paper is organized as follows: In section 2, we define terms used in this paper and outline problems. In section 3, we describe our methods. In section 4, we introduce a prediction and recommendation application based on our proposed algorithms. In section 5, we describe the design of our system. In section 6, we present our experimental results. In section 7, we discuss related work. Section 8 concludes the paper.

2 Definitions

The data set for freight trajectories is similar to that for non-freight trajectories [7]. However, the interests of freight and non-freight sectors are slightly different, and both look for different patterns. This means that different measurements are used, and there are differences between freight GPS records and non-freight GPS records.

Freight GPS record: A GPS record data reported by a freight car. In our research, a GPS record r i contains a time stamp t s , longitude lon, latitude lat, and odometer reading odm and is given by r i = {tsi , loni , lati , odmi }.

Single freight trip: An integrated, temporally continuous series of freight GPS records form a single freight trip, which is given by T = {r 1 , r 2 , ... , r i }. A freight trajectory must have a distinct origin and destination. The time stamp and odometer reading of the first freight GPS record are minimums. Typically, a trip starts after an unusually long period of parking and ends with another unusually long period of parking that is also considered the start of the next trip.

Gas station location: The gas station location is given by gsi = {gsid i , lon i , lat i , rfcount i }. The last attribute rfcount i is the number of refueling behaviors that occur at a gas station within a certain timeframe of the data set. The GPS coordinates of a gas station are an approximate location generated after map digitalization.

Vehicular stop: Vehicular stops emerge along the freight trajectories. A vehicular stop contains a set of continuous GPS records. A vehicular stop has the following attributes: locations, time stamp for when the stop begins, stopping interval, and minimum bound rectangle (MBR). This vehicular stop is given by VSi = {x i , y i , ts i , ti i , x , x , y , y }, where (X i , Yi ) is the expected coordinates of the stop, ts is when the stop begins, and tii is how long the stop lasts. The boundary of the MBR of GPS records for this stop is given by MBR = (x , x , y , y ). We divide the vehicular stops into the following two definitions.

Refueling stop: This is a specific kind of vehicular stop that occurs when the driver refuels during a stop. Refueling stops are distinct from other stops in that they only occur around gas stations. A refueling is given by rsi = {υsi , gsk } where the j th vehicular stop is identified as the i th refueling stop, and the driver refueled at gas station k.

Casual stop: a stop that is not a refueling stop. The main difference between a casual stop and a refueling stop is that the former may occur near a gas station or not; however, the latter occur at a gas station. A casual stop is given by csi = {υsi }.

With enough prior definitions, we can now state our problem: Sets of vehicular stops extracted from trajectories must be correctly classified as refueling stops or casual stops.

In the following, we base our algorithms and applications on the definitions above.

3 Methodology

3.1 Extracting Prior Knowledge and Constructing the

Training Data Set

When a given a vehicular stop is fed to the classifier, the classifier can correctly identify this stop as either casual or refueling. This is the goal of our algorithm. To optimize the classifier, reliable prior knowledge needs to be retrieved. Essential prior knowledge is factors affecting the classifier parameters. For our data set, these factors can be observed from odometer readings and temporal intervals. With prior knowledge, it is possible to predict how far a car has to travel from the refueling stop before it has to refuel. It should also be possible to predict how long the refueling stop will last.

The predicted distance a car travels from the last refueling stop before refueling is υ, and the time of a refueling stop is τ. These parameters tend to converge to two constants for the same route. The greatest distance a car can drive on a full tank is mainly determined by the load of the car, the capacity of the tank, the driver’s driving habits, highway conditions, and distribution of gas stations. In long-distance transport, refueling behavior is usually unique for the same driver using the same vehicle on the same route. The driver is always optimizing their decisions to gain the maximum benefit. A refueling process comprises pumping the gas and paying. Hence, the time of the whole process only decreases by a relatively small amount.

Because the patterns are known, a reasonable method is required to calculate these variables. Authentic patterns can be mined from data obtained from real refueling stops. We propose a naive method with hill climbing algorithm to approximate these variables. Algorithm 1 is the naive method for a given υ.

Algorithm 1. Naive Bayes (VS ; υ )

Require:

VS = {vs 1, vs 2 , ... , vsn }, a set of vehicular stops, when

vsi = {x , y , odm , ti , gsid , gsdis , lastdis , r i = 0}.

υ , the spatial parameter.

Ensure:

VS = {vs 1, vs 2 , ... , vsn }, where vsi .r i is determined.

RS = {rs 1, rs 2, ... , rs m1}, a set of refueling stops.

CS = {cs 1, cs 2, ... ,cs m 2}, a set of casual stops.

1: lastodm = 0, m 1 = 0, m 2 = 0, RS null, CS null ;

2: for i = 1; i ≤ n ; i ++ do

3: vsi .lastdis = vsi .odm - lastodm;

4: if vsi .lastdis > υ then

5: lastodm = vsi .odm;

6: vsi .ri = 1;

7: rsm1 = vsi ;

8: RS :add (rsm1);

9: m1 = m1 +1;

10: else

11: csm 2 = vsi ;

12: CS.add (csm 2);

13: m2 = m2 +1;

14: end if

15: end for

16: return {VS , RS , CS };

The main drawback of the naive method is that it cannot distinguish between nearby refueling stops and casual stops. However, it can give reliable answers for highly distinguishable events. For example, there is a car which stops consecutively at two gas stations at a distance of more than υ apart and without other stops near them. In this case, we can confidently detect two reliable refueling stops and mine the patterns. Nevertheless, we have to eliminate indistinguishable neighbors of the refueling stops.

3.1.1 Eliminating Indistinguishable Neighbors

For each refueling stop detected using the naive Bayes method, there are neighboring stops that can be either refueling stops or casual stops. These neighboring stops are located at a distance d 0 and time t 0 from the refueling stop. These neighboring stops are indistinguishable, so we proceed to the elimination of the pool of this detection and its neighbors without considering these indistinguishable stops in our hill climbing algorithm. Then, we obtain the remaining set of refueling stops RS remain and casual stops CS remain.

After eliminating unreliable detections, we calculate the distance between the patterns from real data υ using

If ι is small enough, we can confidently say that υ corresponds to authentic patterns. Initially, υ cannot be determined from raw trajectories data set because of the first problem mentioned in section 1. To overcome this problem, we propose a hill-climbing algorithm to approximate the best υ with the lowest possible ι. The hill-climbing algorithm is shown in Algorithm 2.

Algorithm 2. Hill-Climbing ({VS }, υinit , increment )

Require:

{VS } = {VS 1, VS 2, ... , VSn }, a set of trajectory, when

VS i = {vs 1i , vs 2i , ... , vs ni }

υ init , an randomly initial spatial parameter.

increment, a constant stands for the increment in iterations.

Ensure:

υ best , the best spatial parameter.

1: υ best = υ init ;

2: while υcurrent ≠υbest do

3: υcurrent = υ best ;

4: ι 1 DistCal ({VS }, υ current );

5: ι 2 DistCal ({VS }, υ current - increment );

6: ι 3 DistCal ({VS }, υ current + increment );

7: υ best argmin{ι 1, ι 2, ι 3,};

8: end while

9: return υ best ;

The algorithm extracts υ best , which produces the smallest ι from {υ, υ - increment, υ + increment} in line 7. The sub-algorithm DistCal is shown in Algorithm 3. The eliminations process in the algorithm conforms to the definition of eliminating indistinguishable neighbors previously given.

Algorithm 3. DistCal ({VS }, V)

Require:

{VS } = {VS 1, VS 2, ... ,VS n}, a set of trajectory, when

{VS i } = {vs 1i , vs 2i , ... , vs ni }

υ, a given spatial parameter.

Ensure:

ι, the distance between real data and given spatial parameter.

1: ι null ;

2: for i = 1; i ≤ n ; i ++ do

3: {VS i , RS i , CS i } NaiveMethod (VS i , υ );

4: if RS i = null then

5: ι current =|VSi .allodm -υ | ;

6: else

7: {RS , CS }

Eliminations (VS i , RS i , CS i );

8: ι current = |RS . lastdis-υ ;|

9: end if

10: ι:add(ι current );

11: end for

12: return ι ;

The hill-climbing algorithm starts from a randomly initiated spatial parameter υ init from [υ , υ ] and then iterates until υ reaches a local minimum. This algorithm can generate a reliable spatial parameter because of its convergence. The number of vehicular stops in a single trajectory with VSi is given by nmax. Then, regardless of how υ changes, the number of detected refueling stops nr lies in [0, n max]. As υ decreases, nr increases until it reaches n max. However, υ continues to decrease, which

makes increase when (VSi × allodm)/n max

is a constant. The results are the same when υ continually increases. Hence, the best approximated υ can be obtained from the real data set as ι decreases.

3.1.2 Constructing the Training Data Set

When υ best has been determined, we can construct the training data set for the naive Bayes classifier. First, we use the naive method on the real data set with υ = υ best. Then, we eliminate indistinguishable neighbors in the outcome {VS } and obtain the training data set, which comprises {RS remain} and {CS remain}. The set {RS remain} is the set of positive examples in the naive Bayes classifier, and {CS remain} is the set of negative examples in the na?ve Bayes classifier.

3.2 A Naive Bayes Classifier with Multiple Hypotheses

Detecting refueling stops can be thought of as a classification problem. An event is unambiguously classified according to evidence derived from observations. A classifier should be able to correctly identify whether a given vsi is a refueling stop or casual stop by observing the time of the stop and distances traveled from last refueling.

3.2.1 A Naive Bayes Classifier Framework

We use a naive Bayes classification framework [8]-[12], which is based on Bayes’ formula [13], [14] and is given by

where Y is the class variable and X is the attribute for each Xi comprising d attributes [9]. From our observations, X = {x 1, x 2}, where d = 2, x 1 = vs i × lastdis and x 2 = vs i × ti ; and Y = {y 1, y 2}, where y 1 = 0 is a casual stop, and y 1 = 1 is a refueling stop. Then, classification problem can be briefly expressed as vs i ×

ri = argmax (P (Y = yi|X)). However, before using (2), P (Y ),

P (Y|X ) has to be calculated first. In all circumstances, P (X ) is a constant [8], [9].

Given a raw data set, P (Y ) can be directly observed and calculated. However, construction of training data set requires indistinguishable neighbors to be eliminated. Hence, we have to eliminate the effects of data loss. There are np positive examples and nn negative examples, so we know that we have eliminated nv indistinguishable stops, of which there are nr refueling stops that have been detected using the naive Bayes method.

Then, we can calculate and

If we assume Gaussian distributions of continuous attributes x 1, x 2 (section 3.1), the class-conditional probability for any attribute Xi can be expressed as

where μij can be estimated using the sample mean x i of Xi for all training data that belong to the class yj , and σ 2ij can be estimated using the sample variance of the same subset of training data [8].

The naive Bayes classifier is suitable for detecting refueling stops based on similar observations from our data set and the characteristics of the naive Bayes classifier [9].

The naive Bayes Classifier is not sensitive to isolated noise points because the class-conditional probabilities are estimated over all data. A single observation of location data may be erroneous due to errors in the GPS receivers, odometer, and other sensors. However, these noises are eliminated by averaging out.

The naive Bayes classifier is not sensitive to irrelevant attributes. In our data set, the distance from last refueling and the length of time spent at a stop are only loosely related.

3.2.2 Multiple-Hypotheses Framework

A naive Bayes classifier does not yet solve the problem of indistinguishable neighbors. We propose a framework that combines multiple hypotheses with a naive Bayes classifier to overcome this problem.

For a set of stops and neighboring stops (including both refueling and casual stops) within a distance of d 0 and time t 0, it needs to be determined whether there is only refueling stop in a set or whether there is no refueling stop in a set.

For a set of neighboring stops VS of sizes m , there are m +1 possible results from the following equation:

where k = 0 indicates that none of m is a refueling stop, and k ≠ 0 indicates that the k th vehicular stop is a refueling stop. We solve the indistinguishable neighbor problem by calculating all possible situations in (4) and determining the highest probability by using argmax{P ′}1×(m +1).

Algorithm 4 the Naive Bayes Classifier and Multiple Hypothesis Combined Framework

Require:

VS = {υs 1, υs 2, ... , υsn }, when υsi =

{x, y, odm, ti, gsid, gsdis, lastdis, ri = 0}.

{μ ij }, {σ ij }, the set of parameters from the training data.

d 0, t 0, two parameters to determine neighbors.

Ensure:

S = {υs 1, υs 2, ... , υsn }, when υsi .ri is determined.

1: for i = 1; i ≤n ; i ++ do

2: NeighborSet i1×m GetNeighborSet (υsi );

3: if NeighborSet i1×m = null then

4: υsi.ri = argmax (P (Y =yj|X ));

5: else

6: k = argmax{P ′}1×(m+1);

7: for j = 0; j < m; j ++ do

8: if j +1 ≠ k then

9: υsi +j .ri = y 1 = 0;

10: else

11: υsi +j .ri = y 2 = 1;

12: end if

13: end for

14: i =i + m ;

15: end if

16: end for

17: return VS;

4 The Application

In this section, we describe an application that uses the previously mentioned algorithms to output reliable indications of refueling behavior. Combining gas station background information and collective intelligence derived from drivers, the application, called refueling prediction and recommendation, provides optimized refueling choices. The application is straightforward. With well-designed logical models, we show how refueling behavior can be determined by mining driver intelligence from freight trajectory data.

Given real-time information about the car, and given background information from other sources, the application tells a driver when and where they should refuel.

The real-time information of a car changes when the car needs refueling. This information includes the distance from where the car last refueled. This attribute is exactly the same as that used in the above algorithms. However, here it is monitored in real time in order to provide instant evidence for our recommendation. The real-time spatial attribute of a specific time stamp t is dt . The statistical spatial constant parameters μc and σc are the successions of the definition in section 3.2.1. These constant parameters can be computed from the set of determined refueling stops by using the above algorithms.

Background information can also dominate the recommendation. Background information has two major parameters which represent the popularity of one gas station and the distance of one gas station from the real-time location of one car. The popularity of one gas station can be obtained by using the Naive Bayes Classifier with Multiple Hypothesis Algorithm on a massive data set and then collecting the results. It emits a possibility show that how likely a refueling will happen on one specific gas station. And the distance of one gas station from the instant location of one car can be used to estimate how likely the car will take a refueling action when it arrives at this gas station, in other words, refueling prediction. The popularity of a gas station i and the distance between this gas station and the car’s instant location at time t are denoted by Pgi and dti , respectively.

By integrating real-time and background information, we can model the problem in this prediction and recommendation system. Given dt of a car at time t, what is the probability P tgi that the car will refuel at gas station i with background information Pgi and dti ? This probability can be computed using

In order to recommend a specific gas station to this car at time t, we can compute (5) for all gas stations and recommend the gas station with the highest probability argmax{Ptgi }. In this real application, we construct the set of candidate gas stations with direction and distance thresholds in order to eliminate invalid candidates for efficiently computing (5).

5 System Framework

Our framework of a refueling detection and recommendation/prediction system comprises data preprocessor, knowledge discovery, and intelligence generator (Fig. 1). This design is general and scalable. We embed our refueling detector and recommendation/prediction into the framework.

The data preprocessor contains a spout and bolts in a Storm cluster that is especially designed to preprocess data from spatial trajectories. The spout continually emits GPS measurements; however, the level of the bolts is determined by logical complexity of the preprocessing tasks. In our system, these bolts clean unqualified GPS records, reorganize the trajectory, and extract vehicular stops. This module constantly receives GPS records and transforms them into trajectories, and the last bolts extract vehicular stops along each trajectory. The output of the last bolts is fed to the next knowledge-discovery module.

The knowledge-discovery module is built on the Hadoop batch-processing platform. This module is mainly responsible for carrying out the most computationally expensive tasks in the system. The batch-processing platform runs these tasks routinely, or only in certain conditions, at low frequency and with latency, unlike the real-time data preprocessor and intelligence generator components. In our case, the naive Bayes classifier with multiple hypotheses algorithm is implemented in MapReduce and run in the Hadoop clusters. The output of algorithms is stored in database systems as knowledge and experience extracted from big data.

The intelligence generator comprises real-time applications that interact with users. It also comprises a spout and bolts, which are located inside the Storm cluster. The intelligence generator and data preprocessor may even share the same spout. However, the data preprocessor digests all the data from the source while the intelligence generator extracts and purifies real-time information from the data source with some of bolts to handle data selections and transformations. Further, after real-time information is captured, the application in a bolt is connected to database systems where knowledge and experience are stored. By combing real-time information and background knowledge, applications in the intelligence generator can accurately provide recommendations in real time. In our case, the bolts in the intelligence generator monitor real-time information for each car, and our recommendation/prediction algorithm is applied to every bolt. Then, a recommendation is made as to whether the driver should refuel.

These three components are highly integrated and are all built on scalable, fault-tolerant distributed systems on Hadoop and Storm platforms. Comprehensive use of these platforms means our system can provide both real-time services and complex data-mining services. Additional applications can be easily integrated into the existing framework in the form of Hadoop MapReduce jobs or Storm bolts, taking the advantage of reusable resources.

6 Performance Evaluation

6.1 Data Descriptions

The GPS data set was collected from 14,800 cars involved in the logistics and freight industries in China. Each car constantly reported its location about every 15 s. The data was collected from October 2009 to April 2012, and real-time data is still incrementing by several Gigabytes every day. Tables 1 shows the data information of the dataset, Tables 2 shows the fields in each record.

Because of the variety of freight routes, our trajectory data set covers most of China. Fig. 2 shows an example route from our data set. This route originates in Shenzhen and ends in Shanghai. The driving distance is about 2000 km.

Besides, we have location information of gas stations all round the country. These massive data of gas stations contain location records of 114,000 stations in China. The number of gas stations matches our knowledge of the number of all gas stations in China. Fig. 3 shows the spatial distributions of all these gas stations.

6.2 Experiment Settings

In the initial part of our experiment, we selected a part of our data set of 1000 cars. This raw data was 12 GB. First, we input this data into the data preprocessor module and collected the output of several vehicular stops from different routes. Second, we ran our Naive Bayes Classifier with Multiple Hypothesis algorithm with υinit = 700. Then, we obtained the result of a set of determined refueling stops with spatial statistical parameters μc and σc. Third, we randomly chose 3000 instant records from different cars and ran our prediction and recommendation application. The input for was instant information from these cars and background information generated in previous procedures. By the end of the three stages, we had obtained 3000 instant records from different cars and routes and recommended a gas station for each car.

6.3 Performance Evaluation

Given the 3000 instant records and 3000 gas stations generated by our application, we designed an indicator called drift to measure the differences between ground truth and assigned recommendations. For the i th instant record, gr (i ) is the recommended gas station, and gt (i ) is the actual gas station where the car refueled. Then, drifti can be obtained for the ith instant record:

The distance (gr (i ), gt (i )) function returns driving distances, recommended gas station gr (i ), and actual gas station gt (i ), which are calculated from historical trajectories instead of straight-line distances.

Fig. 4 shows the cumulative distribution function of drifti measurements for all 3000 samples. This result shows that our recommendation/prediction system is precise, and nearby alternative gas stations can be accurately recommended for refueling. More than 50% of our recommendations were close to the actual gas stations with no drifts. More than 60% of the recommendations drifted within 100 kilometers. Almost 85% of recommendations drifted within 300 kilometers. Assuming an average speed of 120 km/h on highway in China, we provided 85% of users with alternative refueling gas stations within a driving time of around 2.5 hours. This is acceptable because the assigned alternative gas stations are always more attractive to drivers because they provide better services.

However, 15% of recommendations were far away from the actual refueling spots. These should be considered failed recommendations. Possible causes of failed recommendations are unavoidable random corruption in raw GPS records and insufficient experimental data. Although some unqualified records could be removed in the data-cleaning process, defective odometer readings and abrupt discontinuance of trajectories affect the precision of the refueling detector. This may have a detrimental impact on our recommendation system. A partial data set in the big data may cause bias in when calculating the popularity of gas stations. This leads to a lack of equilibrium in the statistical distribution. Accuracy can be improved by dropping inaccurate raw data and adding more reliable raw data in trade of larger system scale.

7 Related Work

Much previous research has been centered on obtaining knowledge from large numbers of trajectories and has resulted in the creation of route recommendation systems and traffic prediction systems [6], [7]. Research on refueling behavior has also been done [1], [3]-[5], [8], [10], [11], [13].

In [1], refueling behavior of gasoline-car drivers was analyzed, and it was found that refueling behavior is strongly correlated to characteristics how a car is driven. This research spurred us to use algorithms that detect refueling stops based on the driving distance. Analysis of refuel availability in [3], [4] model of optimization of refueling stations locating in [5] presented research on spatial distribution of gasoline stations and quantified refuel availability in order to optimizing the locating of gasoline stations. Our refueling recommendation/prediction system evaluates refueling availability with spatial distributions of gasoline stations combined with their popularity among all drivers. However, such previous research work did not use a huge amount of trajectories data, which means that these research results limited in a small range both spatially and temporally. In comparison, our system is based on a big volume of real trajectories data. Moreover, we introduced the classification methodologies for data mining [8]-[11] into our algorithms to achieve high accuracy under the scope of data uncertainty.

A smart route service system for taxicab drivers based on a large amount of historical taxi trajectories was proposed in [6], as well as a graph model to estimate traffic conditions. Our research work shared the same designing ideas with [6] in the data preprocessing model. However, we further integrated batch processing platform Hadoop and real-time stream processing platform to process big data for real-time applications. Our architecture and framework have a systematically advantage in generality and scalability.

8 Conclusion

In this paper, we have integrated machine-learning and data-mining techniques to create a refueling detection algorithm called Naive Bayes Classifier with Multiple Hypotheses. This algorithm detects whether a car has refueled. We have also designed a practical application to recommend refueling alternatives. This application is based on real-time information taken from a vehicular trajectory and background knowledge about the distribution of gas stations and driver experience. Furthermore, we have developed a general framework for integrating Storm real-time processing platform and Hadoop batch-processing platform in order to process and data mine the trajectory under the scope of real-time big data. Our experiment shows that, our recommendation/prediction system recommends acceptable gas station alternatives with low drifts.

Acknowledgment

This work was supported by a grant from the Science Technology and Innovation Committee of Shenzhen Municipality.

References

[1] R. Kitamura and D. Sperling. “Refueling behavior of automobile drivers,” Transportation Research Part A: General, vol. 21, no. 3, pp. 235-245, 1987.

[2] M.A. Nicholas, “Driving demand: What can gasoline refueling patterns tell us about planning an alternative fuel network?” in J. Transport Geography, vol. 18, no. 6, pp. 738-749, 2010.

[3] M. Melaina and J. Bremson, “Refueling availability for alternative fuel vehicle markets: Sufficient urban station coverage,” in Energy Policy, vol. 36, no. 8, pp. 3233-3241, 2008.

[4] D. L. Greene. “Survey evidence on the importance of fuel availability to the choice of alternative fuels and vehicles,” in Energy Studies Review, vol. 8, no. 3, p. 2 1998.

[5] Y. W. Wang and C. C. Lin, “Locating road-vehicle refueling stations,” Transportation Research Part E: Logistics and Transportation Review, vol. 45. no. 5, pp. 821-829, 2009.

[6] J. Yuan, Y. Zheng, L. Zhang, X.I. Xie, and G. Sun. “Where to find my next passenger” in Proc. 13th ACM Int. Conf. on Ubiquitous Comput. (UbiComp’11), 2011.

[7] J. Yuan, Y. Zheng, C. Zhang, X. Xie, and G.Z. Sun, “An interactive-voting based map matching algorithm,” in Proc. 11th Int. Conf. on Mobile Data Management (MDM’10), pp. 43-52.

[8] J. Han and M. Kamber. Data mining: concepts and techniques. Morgan Kaufmann Publishers, 2006.

[9] P. N. Tan et al. Introduction to data mining. Pearson Education India, 2007.

[10] P. Langley, W. Iba, and K. Thompson, “An analysis of Bayesian classifiers,” in Proc. Nat. Conf. on Artificial Intelligence, 1992, pp. 223-223.

[11] P. Domingos and M. Pazzani, “On the optimality of the simple Bayesian classifier under zero-one loss,” Machine Learning, vol. 29, no. 2, pp: 103-130, 1997.

[12] M. Ramoni and P. Sebastiani, “Robust Bayes classifiers,” Artificial Intelligence, vol. 125, no. 1, pp: 209-226, 2001.

[13] R.O. Duda, P.E. Hart, and D.G. Stork. Pattern classification and scene analysis 2nd ed. New York: Wiley, 1995.

[14] R. Sheldon et al. A first course in probability. Pearson Education India, 2002.

Manuscript received: May 18, 2013

上一篇:家兔饲养管理基本技术 下一篇:中学美术教学原则探讨